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~ Avnet Electronics Marketing introduces a new global series of Xilinx® 
ZYN Q SpeedWay Design Workshops™ for designers of electronic applications 
i based on the Xilinx Zyng®-7000 All Programmable (AP) SoC Architecture. 
Taught by Avnet technical experts, these one-day workshops combine 
informative presentations with hands-on labs, featuring the ZedBoard™ 
and MicroZed™ development platforms. Don’t miss this opportunity 
to gain hands-on experience with development tools and design 
techniques that can accelerate development of your next design. 


© Avnet, Inc. 2014. All rights reserved. AVNET is a registered trademark of Avnet, Inc. 
Xilin, Zynq and Vivado are trademarks or registered trademarks of Xilinx, Inc. 


Fast, Simple Analog-to-Digital 
Converter to Xilinx FPGA 
Connection with JESD204B 


Analog Devices’ AD-FMCJESDADC1-EBZ FMC board connected to Xilinx Zynq ZC706 
displaying eye diagram and output spectrum on native Linux application via HDMI. 


Analog Devices’ newest Xilinx FPGA development platform-compatible FPGA 
mezzanine card (FMC), the AD-FMCJESDADC1-EBZ Rapid Development Board, 
is a seamless prototyping solution for high performance analog to FPGA conversion. 


e Rapidly connect and prototype high e Free eye diagram analyzer software 


speed analog-to-digital conversion included in complete downloadable 

to FPGA platforms software and documentation package 
e JEDEC JESD204B SerDes e HDL and Linux software drivers fully 

(serial/deserializer) technology tested and supported on ZC706 and 


e Four 14-bit analog-to-digital other Xilinx boards. 


conversion channels at 250 MSPS 
(two AD9250 ADC ICs) 
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Learn more and purchase at analog.com/AD9250-FMC-ADC DEVICES 
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Xilinx Expands Generation-Ahead 
Lead to Era of UltraScale 


appy New Year! As we move into 2014, I find myself very excited about Xilinx’s 

prospects and predict that this will be banner year for the company and our 

customers. I have been an eyewitness to the tremendous effort Xilinx has made 
over the last six years under the leadership of CEO Moshe Gavrielov. In that time, among 
many accomplishments, Xilinx launched three generations of products and developed an 
entirely new, best-in-class design tool suite. What’s more, by launching our 7 series All 
Programmable devices a few years ago and being first to ship 20-nanometer devices in 
December 2013, Xilinx is growing from the FPGA technology leader to a leader in semi- 
conductor and system innovations. 

The 7 series, to me, marks the beginning of Xilinx’s breakout, as the company not only 
delivered the best FPGAs to the market at 28 nm, but also innovated two entirely new class- 
es of devices: our 3D ICs and Zynq SoCs. 

Meanwhile, our tools group developed from scratch the Vivado® Design Suite and the 
UltraFast™ Design Methodology to help customers get innovations to market sooner. All 
these advancements in the 7 series are remarkable in and of themselves, but what’s more 
impressive is that they set a solid road map for future prosperity for Xilinx and a path toward 
innovation for our customers. It has been deeply rewarding to see this success becoming evi- 
dent and undeniable to the outside world over the last few quarters, as Xilinx now has more 
than 70 percent of total 28-nm FPGA market share with our 7 series devices. These numbers 
truly put wood behind the marketing arrow that Xilinx is a Generation Ahead. 

In my days as a reporter, I often witnessed the back-and-forth bravado in this industry 
between Xilinx and its competition. There was so much of it and with so little concrete proof 
on either side that it became white noise—like listening to two kids bickering in the back- 
seat over whatever. 

Of course, today I’m biased because I work here, but I predict that if Xilinx’s momentum 
and Generation Ahead lead weren’t apparent to you in 2013, they will be in 2014. If you look 
at the numbers and Xilinx’s first-to-market delivery of products over the last two generations, 
it is evident that Xilinx is not content to sit still and wait for the competition to try to leapfrog 
us. As you will read in the cover story, late last year Xilinx delivered the industry’s first 20-nm 
FPGAs to customers months ahead of the competition (which has yet to ship 20-nm prod- 
ucts) and unveiled the portfolios for the 20-nm Kintex® and Virtex® UltraScale™ devices. 
Included in the UltraScale offering is a Virtex FPGA with a capacity four times greater than 
the competition’s largest device, continuing Xilinx’s undisputed density leadership. 

The new UltraScale families are ASIC-class devices that bring incredible value to cus- 
tomers who today are developing the innovations that will shape tomorrow—not just 
those a year or more from now, but products that will define the future of electronics. 
What’s even more exciting is that these launches represent just the beginning of the 
UltraScale era at Xilinx. 
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HAPS Developer eXpress Solution 


Pre-integrated hardware and software for fast prototyping of complex IP systems 
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Designs come in all sizes. Choose a prototyping system that does too. 
HAPS-DX, an extension of Synopsys’ HAPS-70 FPGA-based prototyping 
product line, speeds prototype bring-up and streamlines the integration of 
IP blocks into an SoC prototype. 


To learn more about Synopsys FPGA-based prototyping systems, 
visit www.synopsys.com/haps 
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Accelerating Innovation 


TOOLS OF XCELLENCE 


On Nov. 11 of 2013, Xilinx accomplished a key mile- 
stone in building on its Generation Ahead advan- 
tage by shipping to customers the industry’s first 
20-nanometer All Programmable FPGA—a Kintex® 
UltraScale™ XCKU040 device—months ahead 
of when the competition was purporting to ship 
its first 20-nm devices. Then, in December, Xilinx 
built on that accomplishment by unveiling its en- 
tire 20-nm Kintex UltraScale and Virtex® UltraScale 
portfolios, the latter of which includes the Virtex 
UltraScale VU440. Based on 3D IC technology, this 
4.4 million-logic-cell device breaks world records 
Xilinx already held for the highest-capacity FPGA 
and largest semiconductor transistor count with its 
28-nm Virtex-7 2000T. 

“We are open for business at 20 nm and are ship- 
ping the first of our many 20-nm UltraScale devices,” 
said Steve Glaser, senior vice president of corporate 
strategy and marketing at Xilinx. “With the 20-nm 
UltraScale, we are expanding the market leadership 
that we firmly established with our tremendously 
successful 28-nm, 7 series All Programmable devic- 
es. Today, we are not only delivering to customers 
the first 20-nm devices manufactured with TSMC’s 
20SoC process months ahead of the competition, 
but are also delivering devices that leverage the in- 
dustry’s most advanced silicon architecture as well 
as an ASIC- strength design suite and methodology.” 

All the devices in the 20-nm Kintex UltraScale and 
Virtex UltraScale portfolios feature ASIC-class per- 
formance and functionality, along with lower power 
and higher capacity than their counterparts in Xil- 
inx’s tremendously successful 7 series portfolio (Fig- 
ure 1). The 7 series devices, built at the 28-nm silicon 
manufacturing node, currently control more than 70 
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Figure 1 - The 20-nm Kintex and Virtex UltraScale FPGAs deliver industry-leading features complementary to the Kintex and 


percent of programmable logic device 
industry market share. What’s more, 
with UltraScale, Xilinx has taken addi- 
tional steps to refine the device architec- 
ture and make ASIC-class improvements 
to its Vivado® Design Suite. Last October 
the company introduced a streamlined 
methodology called the UltraFast De- 
sign Methodology (detailed in the cover 
story of Xcell Journal issue 85; http:// 
issuu.com/xcelljournal/docs/«cell_jour- 
nal_issue_85/8 ?e=2232228/5349345). 
Xilinx will follow up its 20-nm UltraScale 
portfolio with 16-nm FinFET UltraScale 
devices in what it calls its “multinode 
strategy” (Figure 2). 

“We are in a multinode world now 
where customers will be designing 
in devices from either our 7 series, 
our UltraScale 20-nm or our upcoming 
UltraScale 16-nm FinFET family depend- 
ing on what is the best fit for their system 
requirements,” said Kirk Saban, product 
line marketing manager at Xilinx. “For 
example, if you compare our 20-nm Kin- 
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Virtex 7 series devices (max counts listed). 


tex UltraScale devices with our Kintex-7 
devices, we have significantly more logic 
and DSP capability in Kintex UltraScale 
than we had in Kintex-7. This is because 
a vast majority of applications needing 
high signal-processing bandwidth today 
tend to demand Kintex-class price points 
and densities. If the customer’s applica- 
tion doesn’t require that additional den- 
sity or DSP capability, Kintex-7 is still a 
very viable design-in vehicle for them.” 

Saban said that the multinode ap- 
proach gives Xilinx’s broad user base 
the most powerful selection of All Pro- 
grammable devices available in the in- 
dustry, while the Vivado Design Suite 
and UltraFast methodology provide un- 
matched productivity. 

“There’s a misconception in the in- 
dustry that Xilinx is going to be leap- 
frogged by the competition, which is 
waiting for Intel to complete its 14-nm 
FinFET silicon design process and then 
fabricate the competition’s next-gen- 
eration devices,” said Saban. “We are 


certainly not sitting still. We are already 
offering our 20-nm UltraScale devices, 
which allow customers to create inno- 
vations today. We will be offering our 
UltraScale FinFET devices in the same 
time frame as the competition. We are 
going to build on our Generation Ahead 
advantage with these devices and the 
ASIC-class advantage of our Vivado De- 
sign Suite and UltraFast methodology.” 

These advances are built on a foun- 
dation of solid manufacturing, Saban 
said. “We are confident we have the 
industry’s strongest foundry partner 
in TSMC, which has a proven track 
record for delivery and reliability,” he 
said. “Foundry is TSMC’s primary busi- 
ness, and they manufacture devices 
for the vast majority of the who’s who 
in the semiconductor industry. What’s 
more, TSMC’s former CTO, now advis- 
er, Chenming Hu actually pioneered the 
FinFET process, and we are very im- 
pressed with their FinFET development 
for next-gen processes.” 
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KINTEX AND VIRTEX ULTRASCALE 
TRANSISTOR COUNTS 

Each silicon process node presents the 
industry with a new set of manufactur- 
ing and design challenges. The 20-nm 
node is no exception. This geometry 
introduced new challenges in terms of 
routing delays, clock skew and CLB 
packing. However, with the Kintex 
UltraScale and Virtex UltraScale devic- 
es, Xilinx was able to overcome these 
challenges and greatly improve overall 
performance and utilization rates. (See 
the sidebar, “What’s the Right Road to 
ASIC-Class Status for FPGAs?”) 

The intricacies of the node also al- 
lowed Xilinx to make several block-lev- 
el improvements to the architecture, 
all of which Xilinx co-optimized with 
its Vivado tool suite to deliver max- 


Past 


imum bandwidth and maximum sig- 
nal-processing capabilities. 

“If we have a closer look at our DSP 
innovations, we've gone to a wider-input 
multiplier, which allows us to use fewer 
blocks per function and deliver higher 
precision for any type of DSP applica- 
tion,” said Saban. “We also included some 
additional features for our wireless com- 
munications customers in terms of FEC, 
ECC and CRC implementations now be- 
ing possible within the DSP48 itself.” 

On the Block RAM front, Xilinx hard- 
ened the data cascade outputs and im- 
proved not only the power, but also the 
performance of BRAM with some new, 
innovative hardened features. 

Xilinx is offering two different kinds 
of transceivers in the 20-nm Kintex and 
Virtex UltraScale portfolios. Mid- and 


Single Node, Only FPGAs 


FPGA 130nm 


FPGA 90nm 
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high-speed-grade Kintex UltraScale de- 
vices will support 16.3-Gbps backplane 
operation. Even the slowest-speed- 
grade devices in the Kintex UltraScale 
family will offer impressive transceiver 
performance at 12.5 Gbps, which is par- 
ticularly important for wireless applica- 
tions. Meanwhile, in its Virtex UltraScale 
products, Xilinx will offer a transceiver 
capable of 28-Gbps backplane opera- 
tion, as well as 33-Gbps chip-to-chip and 
chip-to-optics interfacing. 

“We've added some significant inte- 
grated hard IP blocks into UltraScale,” 
said Saban. “We’ve added a 100-Gbps Eth- 
ernet MAC as hard IP to both the Virtex 
and Kintex UltraScale family devices. We 
also added to both of these UltraScale 
portfolios hardened 150-Gbps Inter- 
laken interfaces and hardened PCI Ex- 


FPGA 45/40nm 


Future 


Concurrent Nodes with FPGAs, SoCs and 3D ICs 


28nm: Long life with optimal price/performance/watt and SoC integrations 


16nm: Complements 20nm with FinFET, multiprocessing, memory 


Figure 2 - Xilinx’s Generation Ahead strategy favors multinode product development, with the concurrent release of FPGA, 
SoC and 3D IC product lines on nodes best suited for customer requirements. 
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Whats the Right Road to 


ASIC-Class Status for FPGAs? 


omething that happened a while ago to ASICs has 

now hit FPGAs. What is it? It’s the dominance of 

routing delay in determining design performance. 

Over the years, Dennard scaling increased tran- 

sistor speed while Moore’s Law scaling increased 
transistor density per square millimeter. Unfortunately, it 
works the other way for interconnect. As wires become 
thinner and flatter with Moore’s Law scaling, they get slow- 
er. Eventually, transistor delay shrinks to insignificance and 
routing delay dominates. With the increasing density of FP- 
GAs and with the entry of Xilinx® UltraScale™ All Program- 
mable devices into the realm of ASIC-class design, the same 
problem has appeared. UltraScale devices have been re-en- 
gineered to overcome this problem, but the solution wasn’t 
easy and it wasn’t simple. Here’s what it took. 


STEP 1: COMPACT THE BLOCKS SO THAT 

SIGNALS DON’T NEED TO TRAVEL AS FAR. 

Sounds obvious, right? Necessity is the mother of inven- 
tion and at UltraScale densities, it was time to act. The 
CLBs in the UltraScale architecture have been reworked 
so that the Vivado® Design Suite can pack a logic de- 
sign into the CLBs more efficiently. Logic-block designs 


Suboptimal CLB Packing 


Suboptimal CLB Packing 


by Steve Leibson 
Editor 

Xcell Daily Blog 
Xilinx, Inc. 


become more tightly packed and less inter-CLB routing 
resource is required as a result. The routing paths also 
become shorter. Changes within the UltraScale architec- 
ture’s CLBs include adding dedicated inputs and outputs 
to every flip-flop within the CLB (so that the flip-flops 
can be used independently for greater utilization); adding 
more flip-flop clock enables; and adding separate clocks 
to the CLBs’ shift registers and distributed RAM compo- 
nents. Conceptually, the improved CLB utilization and 
packing looks like the diagram in Figure 1. 

The example illustration shows that a circuit block for- 
merly implemented with 16 CLBs now fits in nine of the 
improved UltraScale CLBs. The distribution of the small 
blue squares and triangles in the illustration shows that the 
CLB utilization has improved and the reduction in red lines 
shows that routing requirements are reduced as well. 


STEP 2: ADD MORE ROUTING RESOURCES. 

A rise in transistor density from Moore’s Law scaling 
causes the number of CLBs to increase proportional to 
N?, where N is the linear scaling factor of the IC process 
technology. Unfortunately, FPGA routing resources tend 
to scale linearly with N—far more slowly. That’s a situ- 


m Optimal CLB Packing ———. 


Figure 1 — CLB utilization is improved and routing requirements reduced in the UltraScale architecture. 
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ation with rapidly diminishing returns unless you take 
action and solve the problem. For the UltraScale archi- 
tecture, the solution involved adding more local routing 
resources so that the routability improves more quickly 
with increasing CLB density. Figure 2 shows the result. 

However, it’s not sufficient to simply increase the 
hardware-routing resources. You must also enhance the 
design tool’s place-and-route algorithms so that they 
will employ these new resources. The Xilinx Vivado De- 
sign Suite has been upgraded accordingly. 


STEP 3: DEAL WITH INCREASING CLOCK SKEW. 

You might not know this, but FPGA clocking has been pret- 
ty simplistic because it could be. Earlier FPGA generations 
relied on a central clock-distribution spine, which would 
then fan out from the IC’s geometric center to provide 
clocks to all of the on-chip logic. That sort of global clock- 
ing scheme just isn’t going to work in ASIC-class FPGAs 
like the ones you find in the Virtex UltraScale and Kintex 
UltraScale All Programmable device families. Rising CLB 
densities and increasing clock rates won’t permit it. Con- 
sequently, UltraScale devices employ a radically improved 
clocking scheme as seen in Figure 3. 


Logic Elements 


Effect of 
routing 
resources 


terconnect tracks O(N) 


Figure 2 - The blue line shows exponential CLB growth 
with increasing transistor density. The straight red line shows 
the slower, linear growth of inter-CLB interconnect using 
previous-generation routing resources. Note that the straight 
red line is rapidly diverging from the blue curve. The red curve 
shows the improved routability of the enhanced local inter-CLB 
interconnect scheme used in the UltraScale architecture. 
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Domain 
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f Clock Domain 2 
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Figure 3 — A radical new clocking scheme can place 
multiple clock-distribution nodes at the geometric center 
of many on-chip clock domains. 


The UltraScale architecture’s clock-distribution network 
consists of a regionalized, segmented clocking infrastruc- 
ture that can place multiple clock-distribution nodes at the 
geometric centers of many on-chip clock domains. The indi- 
vidual clock-distribution nodes then drive individual clock 
trees built from properly sized infrastructure segments. 
There are at least three big benefits to this approach: 


1. Clock skew quickly shrinks. 
2. There’s a lot more clocking resource to go around. 
3. Timing closure immediately gets easier. 


However, it’s not sufficient to improve the clocking in- 
frastructure unless the design tools support the new clock- 
ing scheme, so the Vivado Design Suite has been upgraded 
accordingly, as it was for the improved inter-CLB routing 
discussed above in Step 2. 

For each of these three steps, Xilinx had to make big chang- 
es in both the hardware architecture and the design tools. 
That’s what Xilinx means when it says that the UltraScale ar- 
chitecture and the Vivado Design Suite were co-optimized. It 
required a significant effort—which was absolutely mandato- 
ry to deliver an ASIC-class All Programmable device portfolio. 

For more information, see the white paper “Xilinx Ul- 
traScale Architecture for High-Performance, Smarter Sys- 
tems” = (http:/huww.xilinx.com/support/documentation/ 
white_papers/wp434-ultrascale-smarter-systems. pdf). 
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press® Gen3 blocks that are capable of 
operation all the way up to Gen3x8.” 

Both UltraScale families support 
DDR4 memory, which offers 40 percent 
higher data rates than what is available 
in 7 series devices while delivering a 20 
percent reduction in overall power con- 
sumption in terms of memory interfaces. 

On the security front, Saban said Xil- 
inx has added features to provide great- 
er key protection, along with the ability 
to implement more detailed and sophis- 
ticated authentication schemes. He add- 
ed that the Vivado tool suite supports all 
of these new enhancements. 

Saban noted that the Virtex and Kin- 
tex UltraScale devices also share the 
same fabric performance. “The Kintex 
UltraScale is really a ‘midrange FPGA’ 
only in terms of its density, not in terms 
of its performance,” he said. 

Last but not least, the 20-nm UltraScale 
builds on Xilinx’s outstandingly suc- 
cessful 3D IC technology, which it pio- 


VIRTEX.” 


neered in the 7 series. With the second 
generation of Xilinx’s stacked-silicon 
interconnect (SSI) technology, the com- 
pany made significant enhancements to 
improve the interdie bandwidth across 
the multidice technology. “SSI enables 
monolithic devices to be realized with 
multiple dice all acting as one within a 
very large device,” Saban said. 

Xilinx was able to leverage this SSI 
technology to once again break its own 
world records for the largest-capacity 
FPGA and highest IC transistor count. 


WORLD-RECORD CAPACITY 

Certainly, a shining star in the 20- 
nm UltraScale lineup is the Virtex 
UltraScale XCVU440. Xilinx imple- 
mented the device with its award-win- 
ning 3D stacked-silicon interconnect 
technology, which stacks several dice 
side-by-side on a silicon interposer to 
which each die is connected. The result- 
ing methodology allowed Xilinx at the 


UltraSCALE 


400G OTN Switching 


400G Transponder 


2x100G Muxponder 


ASIC Prototyping 


400G MAC-to-Interlaken 
Bridge 


4X4 Mixed-Mode Radio 


100G Traffic Manager NIC 


Super-High Vision Processing 


256-Channel UltraSound 


48-Channel T/R Radar Processing 


28-nm node to offer “More than Moore’s 
Law” capacity and establish transis- 
tor-count and FPGA logic-cell capacity 
world records with the Virtex-7 2000T, 
which has 1,954,560 logic cells. With the 
Virtex UltraScale XCVU440, Xilinx is 
smashing its own record by offering a 
20-nm device with 4.4 million logic cells 
(the equivalent to 50 million ASIC gates) 
for programming. The device also is by 
far the world’s densest IC, containing 
more than 20 billion transistors. 

“We anticipate that devices at this 
level of capacity will be a perfect fit for 
ASIC, ASSP and system emulation and 
prototyping,” said Saban. Some vendors 
specialize in creating massive commer- 
cial boards for ASIC prototyping, but 
many more companies build their own 
prototyping systems. A vast majority of 
them are looking for the highest-capacity 
FPGAs for their prototypes. With the 
Virtex-7 2000T, Xilinx was the hands- 
down leader in this segment at the 28- 
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Figure 3 - Xilinx UltraScale devices are ideally suited for the next-generation innovations of smarter systems. 
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nm node. “I can tell you that prototyp- 
ing customers are more than pleased 
at what we are offering with the Virtex 
UltraScale XCVU440,” said Saban. 


ULTRASCALE TO 400G WITH CFP4 
Saban noted that Xilinx diligently 
planned the 20-nm UltraScale portfolio 
to offer next-generation capabilities for 
its entire user base, so as to enable them 
to create the next generation of smarter 
systems (Figure 3). The feature sets of 
the 20-nm Kintex and Virtex lines are 
especially well suited for key applica- 
tions in networking, the data center and 
wireless communications. 

Today the networking space is see- 
ing a tremendous buildout of 100G ap- 
plications. Saban said that leading-edge 
systems are already employing 100G 
technologies, and the technology is fast 
becoming mainstream and expanding 
to peripheral markets connecting de- 
vices to 100G networks. At the same 
time, the leading networking custom- 
ers are already deep into development 
of next-generation 400G and terabit 
equipment. The new UltraScale devices 
are well suited to help those develop- 
ing 100G solutions and those moving to 
more advanced 400G technologies. 

Xilinx’s first-generation SSI technol- 
ogy allowed the company to deliver the 
award-winning Virtex-7 H580T, which 
customers were able to leverage to 
create a 2x100G transponder on-chip 
for networks employing CFP2 optical 
modules (see cover story, Xcell Journal 
issue 80; http:/Assuu.com/xcelljournal/ 
docs/xcell80/8?e=2232228/2002872). 
Now, with second-generation SSI tech- 
nology, Xilinx is enabling customers to 
accomplish an even more impressive feat 
and deliver single-FPGA transponder line 
cards and CFP4 optical modules. 

“Designers looking to migrate to 
CFP4 modules and pack a 400G design 
onto a single FPGA device will need 
UltraScale for multiple reasons,” said 
Saban. “First, it has a high number of 
32G transceivers to interface to CFP4 
optics via next-generation chip-to-chip 
interfaces (CAUI4). Equally important 
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is the ability to support 400G of band- 
width through the device. Next-gener- 
ation routing and ASIC-like clocking 
support the massive data flow needed 
in 400G systems.” 

In addition, he said, the Virtex 
UltraScale also supports fractional 
PLL, which enables a reduction in 
the number of external voltage-con- 
trolled oscillators required in the de- 
sign. “In UltraScale we can use one 
VCXO and then internally generate 
any of the other frequencies needed,” 
said Saban. “From this you'll enable a 
lower BOM cost and power through 
system integration—just on the line 
card itself, not including the cost re- 
duction and power efficiency gained 
from the CFP4 optics.” 


ULTRASCALE TO A LOWER-COST, 
LOWER-POWER NIC 

With the current boom in cloud com- 
puting applications, IT departments 
demand ever-more-sophisticated, low- 
er-power and lower-cost compute mus- 
cle for their data centers. Xilinx Virtex-7 
XT devices, which featured integrated 
x8 Gen3 blocks, have been a center- 
piece in the network interface card at 
the heart of the most advanced data 
center architectures. 

For network interface cards (NICs), 
the throughput (egress) on the PCI 
Express side must keep up with the 
throughput (ingress) on the Ethernet 
side. “And for a 100G NIC, multiple 
PCIe? Gen3 integrated blocks are need- 
ed,” said Saban. “In the previous gener- 
ation, the Virtex-7 XT devices were the 
only devices with integrated x8 Gen3 
blocks to meet these requirements.” 

Now with the UltraScale, it's possible 
to achieve these same performance re- 
quirements with a lower-cost and even 
lower-power Kintex UltraScale FPGA. 
Kintex UltraScale has multiple PCI Ex- 
press Gen3 integrated blocks as well 
as an integrated 100G Ethernet MAC, 
which was a soft IP core in the 7 series. 

“Implementing a NIC in UltraScale 
Kintex not only enables the application 
to be implemented in a midrange de- 


vice, but frees up logic for additional, 
differentiating packet-processing func- 
tions,” said Saban. 


ULTRASALE TO WIRELESS 
COMMUNICATIONS 

In the wireless communications equip- 
ment industry, vendors are in the process 
of simultaneously rolling out LTE and 
LTE Advanced equipment while beginning 
to develop even more forward-looking 
implementations. New systems are sure 
to come down the pike in the form of 
sophisticated architectures that en- 
able multitransmit, multireceive and 
beam-forming functionality. 

Saban said the latest generation of 
beam former equipment at the heart of 
LTE and LTE Advanced systems com- 
monly leverages architectures that rely 
on two Virtex-7 X690T FPGAs, which 
customers chose because of their mix 
of high DSP and BRAM resources. 
Now, with the Kintex UltraScale, Xilinx 
can offer a lower-cost single device that 
will do the same job, Saban said. 

“The Kintex Ultrascale offers 40 
percent more DSP blocks in a mid- 
range device,” he said. “It also has 
pre-adder squaring and extra accumu- 
lator feedback paths, which enables 
better DSP48 efficiency, folding of 
equations and more efficient compu- 
tation. That means that in UltraScale, 
a two-chip Virtex-class application 
can be reduced to a single Kintex 
KU115 for 48-channel processing.” 
The Kintex UltraScale KUl15 has 
5,520 DSP blocks and 2,160 BRAMs, 
which Saban called “the highest signal 
processing available in the industry 
and much greater than a GPU.” 

What's more, Xilinx’s Vivado Design 
Suite features a best-in-class high-level 
synthesis tool, Vivado HLS, which Xil- 
inx has co-optimized to help users effi- 
ciently implement complex algorithms 
for beam forming and other math-inten- 
sive applications. 

To learn more about Xilinx’s 20-nm 
UltraScale Portfolio, visit http://www. 
xilinx.com/products/technology/ul- 
trascale.html. % 
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datory switchover from analog television to digital 
back in 2009 was uneventful for most people. But for 
communications service providers, it marked a significant 
business growth opportunity. Shortly after the conversion, 
the FCC auctioned offto communications companies, emer- 
> gency services and other entities the portions of the ana- 
log TV broadcast spectrum that weren’t needed for digital 
broadcasts. But after the auctions, there remained 300 MHz 
worth of the radio spectrum (the range of 400 to 700 MHz) 
originally designated for UHF TV channels 20 and beyond, 

a terrain only sparsely used by digital broadcasters today. 
Acknowledging the fact that this prime real estate in the 
spectrum isn’t being used to its full potential, a few years ago 
the FCC began working with a small cadre of tech giants— 
among them Google, Microsoft’s Xbox group, Samsung, 
Dell, Intel and Philips—to find viable ways communications 
companies could share the 400-MHz to 700-MHz spectrum 
with TV broadcasters. The idea was to let them offer new 
mobile services in what's called the “white space,” or unused 
channels between the ones that actually carry broadcasts 
(see Figure 1). Companies like Google and Microsoft want 
to use this white space (and similar white-space spectrum in 
countries around the world) to offer a new breed of commu- 
nication services over what equates to a longer-distance and 
a more signal-robust version of Wi-Fi. And the first equipment 
vendor to field a full commercial receiver and transmission 
_ «system that will enable these new services is Xilinx customer 

“+ Adaptrum (San Jose, Calif... 

“White space is a brand new market,” said Adaptrum found- 
er and CEO Haiyun Tang. “It’s a lot like the Wild West at this 
“point—there is a growing land rush to capitalize on this open 
spectrum in white space, and the standards are still evolving.” 
While the market has emerged relatively recently, 
Adaptrum was quick to identify the opportunity and has 
been developing its white-space communications equip- 
ment for a number of years. Tang, who holds a PhD in 
wireless communications from UC Berkeley and became 
steeped in cognitive-radio technology development in his 
professional career, started Adaptrum with Berkeley pro- 
fessor emeritus Bob Brodersen in 2005, with initial backing 
from an Air Force SBIR grant. Since then, the company has 
made impressive progress and secured follow-up venture 
financing. In 2008, Adaptrum began working with the 
FCC to help it form rules for TV white space and secured 
partnerships with Google, Microsoft and others. In April 
of 2012, it became one of the first equipment vendors to 
have a TV-band white-space device certified by the FCC to 
work in conjunction with the FCC-approved Telcordia da- 
tabase. Now, the company is moving from the technology 
from proof of concept to commercialization. In November 
of 2013, the FCC certified Adaptrum's ACRS TV white- 
space solution with the Google TV white-space database. 


T he Federal Communications Commission’s man- 
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Figure 1 - In the United States, white space is found in roughly the 400k to 700k range of the spectrum, noted in blue above. 


(http://www. ntia.doc.gov/tiles/ntia/publications/spectrum_wall_chart_aug2011.pdf) 


The latest spin of the system is built 
around the Xilinx® Kintex-°7 FPGA. 

The ACRS consists of a lunch box-size 
base unit that can mount to traditional 
basestation clusters on poles, sides of 
buildings, mountains and the like, and 
a residential receiver—roughly the size 
of a desktop router or modem—that 
consumers will have in their homes (see 
Figure 2). The system will allow service 
providers to offer connectivity services 
on the order of 20 Mbits over 6 MHz (8 
bits per hertz), which isn’t as fast as Wi- 
Fi but has a much longer range. 

“One of the reasons broadcasters in 
the 1950s selected that portion of the ra- 
dio spectrum for broadcasting was the 
spectrum’s great propagation character- 
istics,” said Tang. “A TV broadcast can 
go through trees, walls and even over 
mountains. Traditional Wi-Fi-based mo- 
bile services have a hard time getting 
past trees, walls and other common ob- 
stacles. If you are right next to a Wi-Fi 
transmitter, you can get 300-Mbit data 
rates, but that data rate can decrease 
dramatically to a few megabits or no 
connection at all when you move away 
from the transmitter and have a few 
walls in between. The 400-MHz to 700- 
MHz spectrum is much more robust. 
Its propagation characteristics allow us 
to offer equipment that has a range of 
up to five miles.” As a result, Tang said, 
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“With TV white-space devices using the 
400-MHz to 700-MHz spectrum, you'll 
start with 20-Mbit data rates but it will 
be the same for miles instead of feet.” 

Tang said he believes that in areas 
where Wi-Fi and mobile service are 
available, service providers can offer 
the new technology as an additional ser- 
vice to ensure connectivity. But in rural 
areas and undeveloped countries, the 
white-space approach might perhaps 
provide even primary data services, 
especially in geographies where it may 
not be economical to deploy wired net- 
works or satellite broadband. 


COMPLIANCE AND PERFORMANCE 
To create a viable commercial solution, 
Adaptrum has had to figure out some 
complex technical challenges involved 
in using white space that meets the 
regulatory requirements while also en- 
suring good performance at a cost level 
that today’s consumers expect. Adap- 
trum first needed to ensure that the 
bands its system would use to transmit 
and receive signals do not in any way in- 
terfere with licensed digital broadcasts. 
An added complexity is that other parts 
of the 400- to 700-MHz spectrum that 
aren’t being used for broadcast can be 
randomly occupied by other devices, 
such as wireless microphones at sport- 
ing events, churches and nightclubs, 


or by medical equipment such as MRI 
products that use radio waves. 

“The number of channels broadcast- 
ing and the frequency each is using will 
vary day to day and region to region,” 
said Tang. “What’s more, different com- 
panies will have different parts of the 
spectrum available for white-space 
equipment, so the equipment needs to 
be flexible and foolproof.” 

To address these challenges, Tang 
said, white-space equipment vendors are 
developing systems that primarily take 
one of two approaches. The first is what’s 
called a database approach, in which pro- 
viders collect TV transmission data from 
the FCC daily (for transmission in the 
United States) and employ a propagation 
model to determine the coverage contour 
of each tower. Anything outside of that 
contour is fair game for use by the white- 
space communications equipment. 

The second technique, called sensing, 
is an autonomous, on-the-fly approach 
to detecting which channels are avail- 
able. In this scheme, the white-space 
equipment continually senses which 
channels are in use and which are not 
to determine which channels it can use 
for communications. At this point, how- 
ever, Tang said the database approach is 
the only one that has been proven and 
certified by the FCC. “It’s the default ap- 
proach thus far, because the FCC hasn’t 
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yet fully tested sensing-based white- 
space equipment and proven that it is 
100 percent reliable,” he said. 

Adaptrum’s system uses a database 
approach, in which each morning, each 
basestation downloads an updated da- 
tabase (from the FCC for the U.S. mar- 
ket, for example) of what bands local 
stations will be using for broadcasting 
on a given day. 

The FCC has certified less than a 
handful of companies as database ser- 
vice providers. These include Telcordia 
(Ericsson), Spectrum Bridge and, most 
recently, Google, which is seemingly 
looking at TV white space as a way to 
make its services available more broad- 
ly worldwide (see http:/hvww.google. 
org/spectrum/whitespace/channel/). 

In addition, there are a number of 
emerging and competing standards that 
strive to define how white space should 
best be used. For example, IEEE 802.22 
is trying to define standards for a re- 
gional-area network. “Its goal is to pro- 
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vide connectivity over a longer range 
than conventional Wi-Fi,” said Tang. “Its 
advertised range is at least 10 miles and 
with a maximum data rate of 20 Mbits 
over 6 MHz (3 bits per hertz).” Mean- 
while, a competing IEEE standard, 
802.11AF, is looking to use the TV white 
space for more traditional Wi-Fi opera- 
tions. “At this point it is still early and 
no one knows for sure what standards 
will be adopted by the market,” said 
Tang. “But there is a lot of excitement 
about this space because of the range 
it provides and the fact that outside of 
metro areas, there is an even greater 
amount of spectrum available.” 

With service providers just now 
forming and standards still being de- 
fined, there are only a scant few com- 
panies—mostly startups—developing 
white-space equipment (transmitters 
and receivers) at this point, but Adap- 
trum seems to be a step ahead of the 
others. One of the reasons for that is 
the company’s use of Xilinx 7 series 
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FPGAs. “We are using the Kintex-7 All 
Programmable FPGA in our transceiver 
solution because of its flexibility. The 
ability to change the design of the de- 
vice to adapt to changing design speci- 
fications, and to adjust it further as the 
standards get solidified, is really valu- 
able to us,” said Tang. 

Tang said that the flexibility of the 
Kintex-7 FPGA will help in selling the 
system to service providers, as the de- 
vice offers the maximum reprogram- 
mability. That’s because the FPGA can 
be reprogrammed and upgraded even 
after service providers sell the white- 
space communications services to cus- 
tomers. This field upgradability means 
that carriers can add enhancements 
and features on the fly over time. In this 
way, they have the opportunity to “sell 
higher-value plans to customers or add 
more services to existing plans, realiz- 
ing significantly more value out of the 
platform than that of a fixed hardware 
solution,” Tang said. % 


Figure 2 - The ACRS 2.0 has an aluminum shell construction sealed and ruggedized for outdoor life. 


It can be pole-mounted or wall-mounted and is powered over Ethernet. 
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Vivados high-level 


synthesis features 
will help you design 
a better sorting 
network for your 


embedded video app. 
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growing number of applications today, from auto- 
A mobiles to security systems to handheld devices, 

employ embedded video capabilities. Each gener- 
ation of these products demands more features and better 
image quality. However, for some design teams, achieving 
great image quality is not a trivial task. As a field applica- 
tions engineer specializing in DSP design here at Xilinx, 
Pm often asked about IP and methods for effective vid- 
eo filtering. Pve found that with the high-level synthesis 
(HLS) capabilities in the new Vivado® Design Suite, it’s 
easy to implement a highly effective median-filtering meth- 
od based on a sorting network in any Xilinx® 7 series All 
Programmable device. 

Before we dive into the details of the method, let’s review 
some of the challenges designers face in terms of image in- 
tegrity and the popular filtering techniques they use to solve 
these problems. 

Digital image noise most commonly occurs when a sys- 
tem is acquiring or transmitting an image. The sensor and 
circuitry of a scanner or a digital camera, for example, can 
produce several types of random noise. A random bit error 
in a communication channel or an analog-to-digital convert- 
er error can cause a particularly bothersome type of noise 
called “impulsive noise.” This variety is often called “salt- 
and-pepper noise,” because it appears on a display as ran- 
dom white or black dots on the surface of an image, serious- 
ly degrading image quality (Figure 1). 

To reduce image noise, video engineers typically apply 
spatial filters to their designs. These filters replace or en- 
hance poorly rendered pixels in an image with the appeal- 
ing characteristics or values of the pixels surrounding the 
noisy ones. There are primarily two types of spatial filters: 
linear and nonlinear. The most commonly used linear filter 
is called a mean filter. It replaces each pixel value with the 
mean value of neighboring pixels. In this way, the poorly 
rendered pixels are improved based on the average values 
of the other pixels in the image. Mean filtering uses low- 
pass methods to de-noise images very quickly. However, this 
performance often comes with a side effect: it can blur the 
edge of the overall image. 

In most cases, nonlinear filtering methods are a better 
alternative to linear mean filtering. Nonlinear filtering is 
particularly good at removing impulsive noise. The most 
commonly employed nonlinear filter is called an order-sta- 
tistic filter. And the most popular nonlinear, order-statistic 
filtering method is the median filter. 

Median filters are widely used in video and image pro- 
cessing because they provide excellent noise reduction with 
considerably less blurring than linear smoothing filters of a 
similar size. Like a mean filter, a median filter considers each 
pixel in the image in turn and looks at its nearby neighbors 
to decide whether or not it is representative of its surround- 
ings. But instead of simply replacing the pixel value with the 
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Figure 1 - Input image affected by impulsive noise. Only 2 percent of the pixels are corrupted, 
but that’s enough to severely degrade the image quality. 


Figure 2 - The same image after filtering by a 3x3 median filter; the impulsive noise has been totally removed. 
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Figure 3 - Block diagram of a sorting network of five input samples. The larger blocks are comparators 
(with a latency of one clock cycle) and the small ones are delay elements. 


mean of neighboring pixel values, the 
median filter replaces it with the medi- 
an of those values. And because the me- 
dian value must actually be the value of 
one of the pixels in the neighborhood, 
the median filter does not create new, 
unrealistic pixel values when the filter 
straddles an edge (bypassing the blur 
side effect of mean filtering). For this 
reason, the median filter is much better 
at preserving sharp edges than any oth- 
er filter. This type of filter calculates the 
median by first sorting all the pixel val- 
ues from the surrounding window into a 
numerical order and then replacing the 
pixel being considered with the middle 
pixel value (if the neighborhood under 
consideration contains an even number 
of pixels, the average of the two middle 
pixel values is used). 

For example, assuming a 3x3 win- 
dow of pixels centered around a pixel 
of value 229, with values 

39 83 225 

5 229 204 

164 61 57 
we can rank the pixels to obtain the 
sorted list5 39 57 61 83 164 204 
225 229. 

The median value is therefore the 
central one—that is, 83. This number 
will replace the original value 229 in the 
output image. Figure 2 illustrates the 
effect of a 3x3 median filter applied to 
the noisy input image of Figure 1. The 
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define KMED 3 // KMED can be 3, 5, 7, O 
define MIN(x,y) ( (x)>(y) ? (y) =: (x) ) 
#define MAX(x,y) ( (x)>(y) ? (x) : (y) ) 


#ifndef GRAY11 

typedef unsigned char pix_t; // 8-bit per pixel 
#else 

#include <ap_int.h> 

typedef ap_int<11> pix t; // 11-bit per pixel 
#endif 


pix_t median(pix t window[KMED*KMED] ) 

{ 

#pragma HLS PIPELINE II=1 

#pragma HLS ARRAY RESHAPE variable=window complete dim=1 


int const N=KMED*KMED; 
pix_t t[N], 2[N]; 
char i, k, stage; 


// copy input data locally 
for (i=0; i<KMED*KMED; i++) z[i] = window[i]; 


// sorting network loop 
for (stage = 1; stage <= N; stage++) 
{ 

if ((stage32)==1) k=0; 

if ((stage%2)==0) k=1; 

for (i = k; i<N-1; i=i+2) 


{ 

t[i ] = MIN(z[i], z[it1]); 
t[i+1] = MAX(z[i], z[itl]); 
aid 1 Sella 18 
z[itl] = t[itl]; 


y // end of sorting network loop 

// the median value is in location N/2+1, 
// but in C the address starts from 0 
return z[N/2]; 


y // end of function 


Figure 4 - Implementation of a median filter via a sorting network in C 
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larger the window around the pixel to 
be filtered, the more pronounced the 
filtering effect. 

Thanks to their excellent noise-reduc- 
tion capabilities, median filters are also 
widely used in the interpolation stages 
of scan-rate video conversion systems, 
for example as motion-compensated in- 
terpolators to convert the field rate from 
50 Hz to 100 Hz for interlaced video sig- 
nals, or as edge-oriented interpolators in 
interlaced-to-progressive conversions. 
For a more exhaustive introduction to 
median filters, the interested reader can 
refer to [1] and [2]. 

The most critical step in employing 
a median filter is the ranking method 
you will use to obtain the sorted list of 
pixels for each output pixel to be gen- 
erated. The sorting process can require 
many clock cycles of computation. 

Now that Xilinx offers high-level syn- 
thesis in our Vivado Design Suite, I 


#define MAX HEIGHT 1080 
#define MAX WIDTH 1920 
#define KKMED 1 // KKMED == 


generally tell people that they can em- 
ploy a simple but effective method for 
designing a median filter in C language, 
based on what’s called the sorting-net- 
work concept. We can use Vivado HLS 
[3] to get real-time performance on the 
FPGA fabric of the Zynq®-7000 All Pro- 
grammable SoC [4]. 

For the following lesson, let’s as- 
sume that the image format is 8 bits 
per pixel, with 1,920 pixels per line 
and 1,080 lines per frame at a 60-Hz 
frame rate, thus leading to a minimum 
pixel rate of at least 124 MHz. Never- 
theless, in order to set some design 
challenges, I will ask the Vivado HLS 
tool for a 200-MHz target clock fre- 
quency, being more than happy if I get 
something decently larger than 124 
MHz (real video signals also contain 
blanking data; therefore, the clock 
rate is higher than that requested by 
only the active pixels). 


1 for 3x3 window 


void ref median(pix t in_pix[MAX HEIGHT][MAX WIDTH], 
pix t out_pix[MAX HEIGHT][MAX WIDTH], 
short int height, short int width) { 


shortzinter, ce 


//raw and col index 


pix t pix, med, window[KMED*KMED]; 


signed char x, y; 


gro 0, ie << Inercia) seep) 


í 


#pragma HLS LOOP_TRIPCOUNT min=600 max=1080 avg=720 


L2:for(c = 


le << Wálchlas Caan A 


#pragma HLS LOOP_TRIPCOUNT min=800 max=1920 avg=1280 


#pragma HLS PIPELINE II=1 


if ( (r>=KMED-1)&&(r< height)&& 
(c>=KMED-1) &&(c<=width) ) 


{ 
ere (Wap WOR Yar) 


for (x=-2; x<=0; x++) 


window[ (2+y) *KMED+(2+x) ]=in_pix[rty][ct+x]; 


pix = median(window); 
} 
else 

pix = 0; 


if(r>0 && c>0) 


out_pix[r-KKMED][c-KKMED] = 


// end of 12 
// end of L1 


vo 


end of function 


pix; 


Figure 5 - Initial Vivado HLS code, which doesn't 
take video line buffer behavior into account 
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WHAT IS A SORTING NETWORK? 
The process of rearranging the ele- 
ments of an array so that they are in 
ascending or descending order is called 
sorting. Sorting is one of the most im- 
portant operations for many embedded 
computing systems. 

Given the huge amount of applica- 
tions in which sorting is crucial, there 
are many articles in the scientific 
literature that analyze the complex- 
ity and speed of well-known sorting 
methods, such as bubblesort, shell- 
sort, mergesort and quicksort. Quick- 
sort is the fastest algorithm for a large 
set of data [5], while bubblesort is the 
simplest. Usually all these techniques 
are supposed to run as software tasks 
on a RISC CPU, performing only one 
comparison at a time. Their workload 
is not constant but depends on how 
much the input data can already be 
partially ordered. For example, given 
a set of N samples to be ordered, the 
computational complexity of quick- 
sort is assumed to be N?, NlogN and 
NlogN respectively in the worst-, av- 
erage- and best-case scenarios. Mean- 
while, for bubblesort, the complexity 
is N’, N?, N. I have to admit that I have 
not found a uniformity of views about 
such complexity figures. But all the 
articles I’ve read on the subject seem 
to agree on one thing—that comput- 
ing the complexity of a sorting algo- 
rithm is not easy. This, in itself, seems 
like a good reason to search for an al- 
ternative approach. 

In image processing, we need deter- 
ministic behavior in the sorting method 
in order to produce the output picture 
at constant throughput. Therefore, 
none of the abovementioned algorithms 
is a good candidate for our FPGA de- 
sign with Vivado HLS. 

Sorting networks offer a way to 
achieve a faster running time through 
the use of parallel execution. The fun- 
damental building block of a sorting 
network is the comparator—a simple 
component that can sort two numbers, 
a and b, and then output their maxi- 
mum and minimum respectively to its 
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= Latency (clock cycles) 


=] Summary 


Latency Interval 


min max min max Type 
10368020 none 


2400019 
= Detail 


10368019 2400020 


= Instance 
= Loop 
Latency 


Loop Name min max Latency 
-L112 2400014 10368014 20 


tilization Estimates 
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Name BRAM_18K  DSP48E FF 
Expression - - 
FIFO = - 
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Memory é 
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Figure 6 - Vivado HLS performance estimate for the elementary reference median 
filter if it were to be used as an effective top function; throughput is far from optimal. 
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Memory 
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Figure 7 - Vivado HLS performance estimate for the top-level median filter function; 
the frame rate is 86.4 Hz, performance even better than what we need. 
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top and bottom outputs, performing 
a swap if necessary. The advantage of 
sorting networks over classical sorting 
algorithms is that the number of com- 
parators is fixed for a given number of 
inputs. Thus, it’s easy to implement a 
sorting network in the hardware of the 
FPGA. Figure 3 illustrate a sorting net- 
work for five samples (designed with 
Xilinx System Generator [6]). Note 
that the processing delay is exactly 
five clock cycles, independently of the 
value of the input samples. Also note 
that the five parallel output signals on 
the right contain the sorted data, with 
the maximum at the top and the mini- 
mum at the bottom. 

Implementing a median filter via 
a sorting network in C language is 
straightforward, as illustrated in the 
code in Figure 4. The Vivado HLS di- 
rectives are embedded in the C code 
itself (#pragma HLS). Vivado HLS re- 
quires only two optimization directives 
to generate optimal RTL code. The first 
is to pipeline the whole function with an 
initialization interval (II) of 1 in order to 
have the output pixel rate equal to the 
FPGA clock rate. The second optimiza- 
tion is to reshape the window of pixels 
into separate registers, thus improving 
the bandwidth by accessing the data in 
parallel all at once. 


THE TOP-LEVEL FUNCTION 

An elementary implementation of a me- 
dian filter is shown in the code snip- 
pet in Figure 5, which we will use as a 
reference. The innermost loop is pipe- 
lined in order to produce one output 
pixel at any clock cycle. To generate 
a report with latency estimation, we 
need to instruct the Vivado HLS com- 
piler (with the TRIPCOUNT directive) 
about the amount of possible itera- 
tions in the loops L1 and L2, since they 
are “unbounded.” That is, the limits of 
those loops are the picture height and 
width, which are unknown at compile 
time, assuming the design can work at 
run-time on image resolutions below 
the maximum allowed resolution of 
1,920 x 1,080 pixels. 
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void top median(pix t in pix[MAX HEIGHT][MAX WIDTH], 


{ 

#pragma 
#pragma 
#pragma 
#pragma 


short 
pix E 


pix_t out pix[MAX HEIGHT][MAX WIDTH], 
short int height, short int width) 


HLS INTERFACE ap vld register port=width 
HLS INTERFACE ap vld register port=height 
HLS INTERFACE ap fifo depth=2 port=out_pix 
HLS INTERFACE ap fifo depth=2 port=in_pix 


int r, c; //row and col index 
pix, med, window[KMED*KMED], pixel[KMED]; 


static pix t line _buffer[KMED] [MAX WIDTH]; 


#pragma HLS ARRAY PARTITION variable=line buffer complete dim=1 


iinilgurere((ia = Me ae < height: CEH) 1 
#pragma HLS LOOP_TRIPCOUNT min=600 max=1080 avg=720 
magro a = Wp e < width; Ctt) 


{ 
#pragma 
#pragma 


HLS LOOP_TRIPCOUNT min=800 max=1920 avg=1280 
HLS PIPELINE II=1 


// Line Buffers fill 


for(int i = 0; 


í 


} 


pix 


i < KMED-1; i++) 


line bufter[i][c] = line bufteer[i+1][c]; 
pixel[i] = line buffter[i][c]; 
= in pix[r][c]; 


pixel[KMED-1]=line buffer[KMED-1][c]=pix; 


// sliding window 


for(int = 0; 


i < KMED; i++) 


for(int j = 0; j < KMED-1; j++) 
window[ i*KMED+j] = window[i*KMED+j+1]; 
On (EA Op ak << ERMED ater) 
window[i*KMED+KMED-1] = pixel[i]; 


// Median Filter 


med 


= median(window); 


if v( (r>=KMED-1)&&(r<height) && 


(e>=KMED-1)&&(c<=width) ) 


pix = med; 
else 
pix = 0; 


if(r>0 && c>0) 


y 14 
y YY 


// KKMED == 1 for 3x3 window 
// KKMED == 2 for 5x5 window 
out _pix[r-KKMED][c-KKMED] = pix; 


end of L2 loop 
end of L1 loop 


y // end of function 


Figure 8 — New top-level C code accounting for the behavior of video line buffers 
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In the C code, the window of pixels 
to be filtered accesses different rows in 
the image. Therefore, the benefits of us- 
ing memory locality to reduce memory 
bandwidth requirements are limited. 
Although Vivado HLS can synthesize 
such code, the throughput will not be 
optimal, as illustrated in Figure 6. The 
initialization interval of the loop L1_L2 
(as a result of the complete unrolling 
of the innermost loop L2, automatical- 
ly done by the HLS compiler) is five 
clock cycles instead of one, thus lead- 
ing to an output data rate that does not 
allow real-time performance. This is 
also clear from the maximum latency of 
the whole function. At a 5-nanosecond 
target clock period, the number of cy- 
cles to compute an output image would 
be 10,368,020, which means a 19.2-Hz 
frame rate instead of 60 Hz. As detailed 
in [7], the Vivado HLS designer must ex- 
plicitly code the behavior of video line 
buffers into the C-language model tar- 
geted for RTL generation, because the 
HLS tool does not automatically insert 
new memories into the user code. 

The new top-level function C code is 
shown in Figure 8. Given the current 
pixel at coordinates (row, column) 
shown as in_pix[r][c], a sliding win- 
dow is created around the output pixel 
to be filtered at coordinates (1-1, c-1). 
In the case of a 3x3 window size, the 
result is out_pix[r-1][c-1]. Note that in 
the case of window sizes 5x5 or 7x7, 
the output pixel coordinates would 
be respectively (1-2, c-2) and (1-3, c-3). 
The static array line_buffer stores as 
many KMED video lines as the num- 
ber of vertical samples in the median 
filter (three, in this current case), and 
the Vivado HLS compiler automatical- 
ly maps it into one FPGA dual-port 
Block RAM (BRAM) element, due to 
the static C keyword. 

It takes few HLS directives to 
achieve real-time performance. The 
innermost loop L2 is pipelined in 
order to produce one output pix- 
el at any clock cycle. The input 
and output image arrays in_pix 
and out_pix are mapped as FIFO 
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Vivado HLS Report Comparison 


Performance Estimates 


=] Timing (ns) 


default Target 5.00 5,00 5.00 
Estimated 3,37 3,37 3,37 


El Latency [clock cycles) 


Latency i 8 24 48 
8 24 48 

Interval i T hi 1 
1 1 $ 


Utilization Estimates 


sol2_med_3x3 solł_med_5x5  soll_med_7x7 
BRAM_18k 0 
DSP48E 0 
FF 457 
LUT 656 


Figure 9 - Vivado HLS performance estimates comparison for the median filter 
function standalone in cases of 3x3, 5x5 and 7x7 window sizes. 


Vivado HLS Report Comparison 


All Compared Solutions 


sol? med 3x3 11bit: xc?z020clg484-1 
sol? med 3x3 k7: — xc7k325tffg900-2 


Performance Estimates 


El Timing (ns) 
Clock sol2med_33_11bit  sol2_med_3x3_kT 
default Target 5,00 2,00 


El Latency (clock cycles) 
sol2 med_33_11bit  sol2_med_3x3_k7 
Latency i 480019 480028 
2073619 2073628 
Interval i 480020 480029 
2073620 2073629 


Utilization Estimates 


sol2 med_33_11bit  sol2_med_3x3_k7 
BRAM_18K 4 
DSP48E 1 


Figure 10 - Vivado HLS comparison report for the 3x3 top-level function in case 


of either 11 bits on the 7Z02 or 8 bits on a 7K325 device. 
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Clock sold_med_3x3  sold_med_5x5 = sol2_med_?x? 


sol2_med_3x3 sol2med_5x5 sol2_med_?x? 
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streaming interfaces in the RTL. 
The line_buffer array is partitioned 
into KMED separate arrays so that 
the Vivado HLS compiler will map 
each one into a separate dual-port 
BRAM; that increases the number of 
load/store operations due to the fact 
that more ports are available (every 
dual-port BRAM is capable of two 
load or store operations per cycle). 
Figure 7 shows the Vivado HLS per- 
formance estimate report. The max- 
imum latency is now 2,073,618 clock 
cycles. At an estimated clock period 
of 5.58 ns, that gives us a frame rate 
of 86.4 Hz. This is more than what 
we need! The loop L1_L2 now exhib- 
its an Il=1, as we wished. Note the 
two BRAMs that are needed to store 
the KMED line buffer memories. 


ARCHITECTURAL EXPLORATION 
WITH HIGH-LEVEL SYNTHESIS 

One of the best features of Vivado HLS, 
in my opinion, is the creative freedom 
it offers to explore different design ar- 
chitectures and performance trade-offs 
by changing either the tool's optimiza- 
tion directives or the C code itself. Both 
types of operations are very simple and 
are not time-consuming. 

What happens if we need a larger me- 
dian filter window? Let's say we need 5x5 
instead of 3x3. We can just change the 
definition of KMED in the C code from “3” 
to “5” and run Vivado HLS again. Figure 
9 shows the HLS comparison report for 
the synthesis of the median filter routine 
standalone in the case of 3x3, 5x5 and 7x7 
window sizes. In all three cases, the rou- 
tine is fully pipelined (H=1) and meets the 
target clock period, while the latency is re- 
spectively 9, 25 and 49 clock cycles, as one 
expects from the sorting network's behav- 
ior. Clearly, since the amount of data to be 
sorted grows from 9 to 25 or even 49, the 
utilization resources (flip-flops and lookup 
tables) also grow accordingly. 

Because the standalone function is 
fully pipelined, the latency of the top-lev- 
el function remains constant, while the 
clock frequency slightly decreases when 
increasing the window size. 
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So far we have only discussed using 
the Zynq-7000 All Programmable SoC 
as the target device, but with Vivado 
HLS, we can easily try different tar- 
get devices in the same project. For 
example, if we take a Kintex®-7 325T 
and synthesize the same 3x3 median 
filter design, we can get a place-and- 
route resource usage of two BRAMs, 
one DSP48E, 1,323 flip-flops and 705 
lookup tables (LUTs) at a 403-MHz 
clock and data rate, whereas on the 
Zynq SoC, we had two BRAMs, one 
DSP48E, 751 flip-flops and 653 LUTs 
at a 205-MHz clock and data rate. 

Finally, if we want to see the re- 
source utilization of a 3x3 median 
filter working on a gray image of 11 
bits per sample instead of 8 bits, we 
can change the definition of the pix_t 
data type by applying the ap_int C++ 
class, which specifies fixed-point 
numbers of any arbitrary bit width. 
We just need to recompile the proj- 
ect by enabling the C preprocessing 
symbol GRAY11. In this case, the re- 
source usage estimate on the Zynq 
SoC is four BRAMs, one DSP48E, 
1,156 flip-flops and 1,407 LUTs. Fig- 
ure 10 shows the synthesis estima- 
tion reports of these last two cases. 


FEW DAYS OF WORK 

We have also seen how easy it can 
be to generate timing and area esti- 
mations for median filters with dif- 
ferent window sizes and even with 
a different number of bits per pixel. 
In particular, in the case of a 3x3 
(or 5x5) median filter, the RTL that 
Vivado HLS automatically generates 
consumes just a small area on a Zynq 
SoC device (-1 speed grade), with 
an FPGA clock frequency of 206 (or 
188, for the 5x5 version) MHz and 
an effective data rate of 206 (or 188) 
MSPS, after place-and-route. 

The total amount of design time to 
produce these results was only five work- 
ing days, most of them devoted to building 
the MATLAB® and C models rather than 
running the Vivado HLS tool itself; the latter 
task took less than two days of work. % 
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Xcell Journal Adds 
New Daily Blog 


Xilinx has extended the Award Winning Journal and added an exciting new Xcell Daily 
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Verifying that your RIL 
module or FPGA meets its 


requirements can be 

a Challenge. But there are 
ways to optimize this 
process to ensure success. 


by Adam P. Taylor 
Head of Engineering - Systems 
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"UP "Y erification of an FPGA or RTL 

| ` | module can be a time-con- 
suming process as the engi- 
neer strives to ensure the de- 
sign will function correctly 
against its requirement specification and 
against corner cases that could make the 
module go awry. Engineers traditionally 
achieve this verification using a testbench, 
a file that you will devise to test your de- 
sign. However, testbenches can be simple 
or complicated affairs. Let us have a look 
at how we can get the most from our test- 
bench without overcomplicating it. 


WHAT IS VERIFICATION? 

Verification ensures the unit under test (UUT) 
meets both the requirements and the specifi- 
cation of the design, and is thus suitable for 
its intended purpose. In many cases a team 
independent of the design team will perform 
verification so as to bring a fresh eye to the 
project. This way, the people who design the 
UUT are not the ones who say it works. 

With the size and complexity of modern 
FPGA designs, ensuring that the UUT per- 
forms against its specification can be a con- 
siderable task. The engineering team must 
therefore determine at the start ofthe project 
what the verification strategy will be. Choices 
can include a mix of the following tactics: 


e Functional simulation only— This tech- 
nique checks whether the design is func- 
tionally correct. 


Functional simulation and code cover- 
age— This method checks that along with 
the functional correctness, all the code 
within the design has been tested. 


Gate-level simulation— This technique 
likewise verifies the functionality of the 
design. When back-annotated with timing 
information from the final implemented 
design, this type of simulation can take 
considerable time to perform. 


Static timing analysis—This method ana- 
lyzes the final design to ensure the module 
achieves its timing performance. 


Formal equivalence checking—Engineers 
use this technology to check the equiva- 
lence of netlists against RTL files. 
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n the quest to gain the maximum benefit 
from the processing system within a Xilinx® 
Zynq®-7000 All Programmable SoC, an operat- 
ing system will get you further than a simple 
bare-metal solution. Anyone developing a Zynq SoC 
design has a large number of operating systems to 
choose from, and depending upon the end applica- 
tion you may opt for a real-time version. An RTOS 
is your best choice if you are using the Zynq SoC in 
industrial, military, aerospace or other challenging 
environments where response times and reliable 
performance are required to prevent loss of life or 
injury, or to achieve strict performance goals. 

To get a feel for how best to add an RTOS to 
our Zynq SoC system, we will be using one of the 
most popular real-time operating systems around, 
the pC/OS-III from Micrium. This RTOS or earli- 
er versions of it have been used on a number of 
very exciting systems, including the Mars Curi- 
osity Rover. The latest version is currently in the 
process of being certified for MISRA-C, DO178B 
Level A, SIL3/4 and IEC61508 standards, which 
means it should have a wide appeal to many Zynq 
SoC users. But before going into the implemen- 
tation details, it’s helpful to review the basics of 
real-time operating systems. 


WHAT IS A REAL-TIME OPERATING SYSTEM? 
What makes a real-time operating system different 
from a standard operating system? Well, a real-time 
operating system is deterministic, which means 
that the system responds within a defined deadline. 
This determinism can be important for a number of 
reasons, for instance if the end application is mon- 
itoring an industrial process and has to respond to 
events within a specified period of time, as would 
be the case for an industrial control system. 

RTOSes are further subdivided based upon their 
ability to meet these deadlines. This categorization 
gives rise to three distinct types of RTOS, each of 
which addresses the concept of deadlines differ- 
ently. In the hard RTOS, missing a deadline is seen 
as a system failure. That’s not the case for the firm 
RTOS, where an occasional missed deadline is ac- 
ceptable. In the soft RTOS, meanwhile, missing a 
deadline reduces the usefulness of the results, but 
the system as a whole can tolerate this. 

Real-time operating systems revolve around the 
concept of running tasks (sometimes called pro- 
cesses), each of which performs a required func- 
tion. For example, a task might read in data over 
an interface or perform an operation on that data. 
A simple system may employ just one task, but it 
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With time sharing, each task gets a dedicated 
time slot on the processor. Higher-priority tasks 
can be allocated multiple time slots. 


is more likely for multiple tasks to be 
running on the processor at any one 
time. Switching between these tasks is 
referred to as “context switching,” and 
it requires that the state of the proces- 
sor associated with each task be stored 
and added to the task stack. 

The process of determining which 
task is to run next is controlled by 
the kernel (the core of the RTOS that 
manages input/output requests from 
software and translates them into 
data-processing instructions for the 
central processing unit and function- 
al elements in the processor). Task 
scheduling can be complicated, espe- 
cially if we want to avoid deadlock, a 
state in which two or more tasks lock 
each other out. The two basic meth- 
ods are time sharing and event-driv- 
en. With time sharing, each task gets 
a dedicated time slot on the processor. 
Higher-priority tasks can be allocated 
multiple time slots. This time slicing 
is controlled via a regular interrupt or 
timer, and is often called “round-rob- 
in scheduling.” With an event-driven 
solution, tasks are switched only when 


one with a higher priority is required 
to run. This is often called “preemptive 
scheduling.” 


DEADLOCKS, RESOURCE 
SHARING AND STARVATION 

When two or more processes need 
to use the same resource—such as a 
UART, ADC or DAC—it is possible for 
them to request this resource at the 
same time. In such a situation, access 
needs to be controlled in order to pre- 
vent contention. How this is managed 
is very important. Without the correct 
management, issues such as “dead- 
lock” or “starvation” might occur, re- 
sulting in system failure. 

Deadlock occurs when a process is 
holding one resource and cannot re- 
lease it, because it is unable to com- 
plete its task. It requires another re- 
source that is currently being held by 
another process. Since the system will 
remain in this state indefinitely, the ap- 
plication is said to be deadlocked. As 
you can imagine, deadlock is a bad sit- 
uation for a real-time operating system 
to find itself in. 


Starvation occurs when a process 
cannot run because the resources it 
needs are always allocated to another 
process. 

It probably won’t surprise you to 
hear that there have been a lot of things 
written on these subjects over the years, 
and there are many proposed solutions, 
such as the Dekker algorithm, a clas- 
sic fix for the mutual-exclusion prob- 
lem in concurrent programming. The 
most commonly used method to handle 
these kinds of situations is semaphores, 
which are usually of two types—binary 
semaphores and counting semaphores. 

Typically, each resource has a bi- 
nary semaphore allocated to it. A 
requesting process will wait for the 
resource to become available before 
executing. Once the task is completed, 
the requesting process will release the 
resource. These semaphores are com- 
monly known as WAIT and SIGNAL 
operations. A process will WAIT on a 
semaphore. If the resource is free, the 
process will then be given control of 
that resource and it will run until com- 
pleted, at which point it will SIGNAL 


¡<< >» My Computer » Local Disk (C:) » Xilinx » 14.4 » ISEDS » EDK » sw >» lib » sw_apps > 


File Edit View Tools Help 
mo Organize + = ~ CH Open 
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Figure 1 - The directory structure showing the location of the demo files 
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completion. However, if the resource 
is already occupied when the process 
WAITs on the semaphore, then the 
process will be suspended until the 
resource becomes free. This could oc- 
cur as soon as the currently executing 
process is finished, but there could 
be a longer wait if this process is pre- 
empted by a process of higher priority. 
A special class of binary semaphores 
called mutexes (derived from the term 
“mutual exclusion”) are often used to 
prevent priority inversion. 

Counting semaphores work in the 
same way as binary semaphores, but 
they are used when more than one in- 
stance of a particular type of resource 
is available (for example, data stores). 
As each of the resources is allocated 
to a process, the count is reduced to 
show the number of resources re- 
maining free. When the count gets 
to zero, there are no more resources 
available, and the requesting process 
will be suspended until one of the re- 
sources is released. 

It is often necessary for process- 
es to communicate with one another. 
There are a number of methods that 


First Quarter 2014 


XPLANANTION: 


Ne Para é 

Applicaton Project 
Croats e managed mar spp ten peagi 
Puga name terso 


Y e dla outen 


Degree 
Medene Pattores p minien me pittore 


Figure 2 - Selecting the operating system 


Templates 


Create one of the available templates to generate a fully-functioning 
application project. 


Available Templates: 


AS 
Dhrystone 

Empty Application 

Hello World 

IwiP Echo Server 

Memory Tests 

Peripheral Tests 


This application runs on the ARM 
Cortex-A9 processor 0 and creates 3 
application tasks: 

1. Task Start: Initializes uC/OS-M and 
Zynq FSBL and printing dot '.' every 100 
milliseconds. 

2. Task #1: Prints '1' every 1-second. 
3. Task #2: Prints '2' every 2-seconds. 


Requirements: 
- Xilinx Zynq 7000 Board (ZC702) 
- UART at 115200 bps 8-N 


Figure 3 - Selecting the uCOS-Ill demo 


FPGA 101 


Micrium uC/OS-I The Real-Time Kernel « 


creates kernel objects. It remains running 
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can be employed, the simplest of which 
is to use a data store and semaphores as 
described above. More complex tech- 
niques include message queues. With 
message queues, when one process 
wishes to send information to anoth- 
er process, it POSTs a message to the 
queue. When a process wishes to receive 
a message from a queue, it PENDs on the 
queue. Message queues therefore work 
like a FIFO (first-in, first-out) memory. 


THE pC/OS-III OPERATING SYSTEM 
Micrium’s pC/OS-III is a preemptive 
RTOS, which means it will always 
run the task with the highest priori- 
ty that is ready to execute. The first 
step in adding it to your Zynq SoC 
system design is to download the 
nC/OS-II RTOS from the Micrium 
website. Once you’ve done that, the 
installation is very straightforward. 
You just need to extract a few ZIP 
files into the correct folders (direc- 
tories) under your Xilinx installation 
on your computer. 
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Figure 4 - Getting the settings correct 


Make sure that you extract the ZIP 
file named Zynq-7000-ucosiii-bsp.zip 
into your \<XILINX> \ISE_DS\EDK\ 
sm lib\bsp\ folder. You will observe 
a number of the other operating sys- 
tems under this folder, including the 
standalone and the xilkernel. Next, 
extract the ZIP file named Zynq-7000- 
ucosiii-demo.zip into your \<XILINX> 
\ISE_DS\EDK\sw\lib\sw_apps\ folder, 
as shown in Figure 1. Once again, you 
will see a number of other application 
demos within this folder. 

Having installed these two sets of files, 
we are ready to begin creating our proj- 
ect within the software development kit 
(SDK). We will be using the same base 
hardware that was created before, but 
we will require a new application and 
board support package (BSP), since we 
wish to include the operating system. 

Within the SDK, close all open projects 
except the base hardware design. Next, se- 
lect the File > New > Application Project 
option, give the new project a name and 
select the operating system pC/OS-IHI (see 


Figure 2). Then select the demo applica- 
tion for pC/OS-II (see Figure 3) 

Once you are happy, click the Finish 
button. The application and board sup- 
port package (if you choose that op- 
tion) will be created within the SDK. If 
you have the Auto Build option select- 
ed, you may find that a few errors are 
reported. This is because not all the 
project references are correct yet. To 
set these project references, you need 
to import the demo settings, which you 
will find under the Project > Src > Set- 
tings option. Right-click on this XML 
file and view the Properties. This will 
allow you to select and copy the loca- 
tion of this file, as Figure 4 illustrates. 

Once you have copied this location, 
right-click on the project and select 
Properties. Under the heading C/C++ 
General, select the Paths and Symbol 
options. Then select Import Settings and 
paste in the location of the settings file. 

It is also important to ensure that the 
repositories are correctly pointing to the 
libraries you added earlier. You can check 
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these by setting Xilinx Tools > Repositories, 
which should show the location where you 
installed the pC/OS-II BSP earlier. 

Since we wish to use the UART to 
output the status of the demo—show- 
ing the initialization being completed 
and the tasks running— you may need 
to set the stdin and stdout to the UART 
under the BSP settings. 

Having performed these actions, 
you will see that the project can now 
be built. However, there will still be 
a few warnings, and if you tried to 
run this project on your hardware 
it would not perform as the demo 
states it should. This is because of a 
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warning over undeclared functions. 
Including the following statement 
within the bsp.c file should correct 
this issue. 


#include "xil cache.h 


Once I added this “include” head- 
er file, the project built and ran as ex- 
pected on my ZedBoard (see my You- 
Tube video at hitp:/hvww.youtube.com/ 
watch?v=URB4LasijrA). 


UP AND RUNNING 

Having got the example up and run- 
ning, you now have confidence that 
the RTOS has been implemented cor- 
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rectly on your system. Now you can 
proceed to correctly implement the 
software design on the Zynq SoC. Once 
you have created the software applica- 
tion and the engineering team is ready 
to try it out on the hardware, you can 
create a programming file in exact- 
ly the same way as you would for a 
bare-metal system (see Xcell Journal 
issue 83, “How to Configure Your Zynq 
SoC Bare-Metal Solution,” http://issuu. 
com/xcelljournal/docs/xcell_journal_ 
issue_83/40 ?e=2232228/2 101904), 
enabling the application with RTOS to 
boot and execute from the configura- 
tion memory. e 
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XPLANATIO 


How to Make a 
Custom XBD File 
for Xilinx Designs 


Creating a cústom Xilinx Board 
Description file gn be a timesaver 
and can help keep your design 
project on track. It’s not difficult 
to develop one for any board 


you are designing. > 


a4 En 
42 Xcell Journal ` e = =A B First Quarter 2014 _ 


PGA vendors provide a number 

of good evaluation and appli- 

cation-specific boards for those 
wishing to assess their FPGAs or even 
as a basis for developing systems. Some- 
times, however, a design project may 
require functionality that isn’t offered 
on an evaluation board or might need a 
smaller end system. In these cases, de- 
sign teams must build a custom board. 

Each Xilinx® evaluation board comes 
with a Xilinx Board Description (XBD) 
file that lists the peripherals, their config- 
uration, control registers and pin locking 
with the FPGA on the board. XBD files 
are very useful in that they keep design 
teams organized and help them formu- 
late the best strategy for current designs 
and even for future designs that they 
may implement on the board. 

Of course, if you are creating a custom 
board you won't have an XBD file from 
Xilinx to fall back on. But it is certainly 
worthwhile to take the time to develop 
an XBD file of your own. A properly pre- 
pared XBD file helps your design team 
manage the project and also streamlines 
your device driver and firmware develop- 
ment. Luckily, with a bit of research and 
effort, you can build a custom XBD file 
for your board without undue difficulty. 
(For those using the Vivado Design Suite, 
Xilinx offers refined XBD capabilities in 
a new utility called Board Manager that 
Xilinx introduced with version 2014.1 of 
the tool suite. For more information, see 
the “Vivado Design Suite User Guide,” 
http /Wwww.xilinx.com/support/docu- 
mentation/sw_manuals/xilinx2013_3/ 
ug898-vivado-embedded-design.pdf.) 

Let’s look at one method for develop- 
ing a custom XBD file. For the purposes 
of this example design, we'll be creating 
our XBD file for a custom board using a 
Virtex®-5 FX30T FPGA. 

The best place to start is with the doc- 
umentation from Xilinx and distributor 
Avnet. Because we'll have to write our 
own XBD file, we will have to do itin the 
XBD syntax. Xilinx has documented the 
XBD syntax in the Platform Specification 
Format Reference Manual (see hittp:// 
www.xilinx.com/support/documenta- 
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tion/sw_manuals/xilinxl I/psf_rm.pdf). 

It's very likely that your custom board 
will require serial communications 
(RS232 and RS422), analog-to-digital 
converters (ADCs), digital-to-analog 
converters (DACs), RAM and flash 
memory. The evaluation boards from 
Xilinx and Avnet will also have these 
peripherals, so finding a board that has 
similar parts and looking at its related 
XBD file will speed up the develop- 
ment of your custom XBD file. 

Each XBD file will have different 
blocks that define the FPGA inter- 
faces the board supports, and each 
block has a list of attributes, param- 
eters and ports. Thus, the first entry 
in the file starts with global attribute 
commands, vendor information, the 
name of the board and its revision 
number, a Web URL for assistance, 
and both short and long descriptions 
of the board. 

The local attribute command of the 
file is defined between a BEGIN-END 
block and expressed in the specific 
format that is available in the Plat- 
form Specification Reference Manual. 
For this example, we'll use the ISE? 
version 12.4 design tools targeting our 
Virtex-5FX30T FPGA, which includes a 
hard PowerPC? 440 core apart from its 
configurable logic cells. 

Other than the FPGA, a custom 
board will contain different periph- 
erals based on the needs of the de- 
sign, such as a serial communication 
interface (RS232, RS422), ADC, DAC, 
SRAM and so on. You can fulfill mul- 
tiple UART requirements by using 
specialized intellectual property (IP) 
blocks for serial communication. For 
example, you can use external mem- 
ory controller (EMC) IP for interfac- 
ing SRAM with the FPGA, and gener- 
al-purpose I/O (GPIO) IP to link the 
ADC and DAC with the FPGA. 

For our example design, we pre- 
pared our custom XBD file to meet 
functional and device requirements 
as listed in the device datasheet. The 
input clock signal to the FPGA is 20 
MHz. The processor is running at 200 
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Figure 1 - Block diagram of custom hardware design 


MHz and the Processor Local Bus (PLB) is operating 
at 100 MHz. Based on this information, we can be sure 
we have maintained local device driver timing. Figure 1 
shows a block diagram of the custom hardware. 

Let us begin with the custom file. It starts with the global 
attribute command and after that the clock signal, which is 
compulsory for all the boards, as follows: 


ATTRIBUTE VENDOR = Xilinx Board FX30T 
ATTRIBUTE NAME = Virtex 5 FX30T 

ATTRIBUTE REVISION = A 

ATTRIBUTE SPEC_URL = www.xilinx.com 
ATTRIBUTE CONTACT INFO URL = http://www. 
xilinx.com/support/techsup/tappinfo.htm 
ATTRIBUTE DESC = Xilinx Virtex 5 FX30T Cus- 
tom Platform 

ATTRIBUTE LONG DESC = ‘The FX30T board is 
intended to showcase and demonstrate Vir- 
tex-5 technology. This board utilizes Xil- 
inx Virtex 5 XC5VFX30T-FF665 device. The 
board includes ADC, DAC, RS232, RS422, SRAM, 
PLATFORM FLASH, CPU Debug (JTAG) and CPU 
Trace connectors. ’ 


BEGIN IO INTERFACE 
ATTRIBUTE INSTANCE = clk_1 
ATTRIBUTE IOTYPE = XIL CLOCK V1 
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PARAMETER CLK FREQ = 20000000, IO IS = 
clk_freq, RANGE = (20000000) # 20 Mhz 

PORT USER SYS CLK = CLK_20MHZ, IO IS = 
ext_clk 
END 


After this we need to list all the peripherals of the board, 
one by one, in this file. (Readers interested in the detailed 
coding information for each peripheral block may refer to 
the companion PDF titled “Details of XBD Coding for a Cus- 
tom Board,” at http: /www.xilinx.com/publications/xcel- 
lonline/xbd_coding.pdf.) 


DIGITAL-TO-ANALOG CONVERTER 

Let us begin with the digital-to-analog converter, the AD7841 
from Analog Devices. This DAC has eight channels, three 
address lines and 14-bit data lines, along with a handful of 
control signals to handle the functioning of the device. It 
interfaces with the FPGA using a GPIO IP core provided by 
Xilinx. Address lines of the device (A0-A2) are connected to 
the processor address lines. 

The DAC has four control signals: LDACN, CSN, WRN and 
CLRN. There are two ways to structure these signals— you 
could either assign one bit to each signal, or directly frame one 
register of 4 bits. It all depends on how you want to handle 
these signals for your application. This DAC is 14 bits, DO-D13. 

Now that we have one detail of the first device, the DAC, 
listed in our XBD file, we need to turn our attention to writ- 
ing firmware. In developing the device driver, we need to 
go through the datasheet and timing diagram very carefully. 
The timing diagram is shown in Figure 2. 

The datasheet of the device explains that the user has 
to read the timing diagram and generate a control signal. 
Timing information t0-t11 is exactly followed as per the 
datasheet (AD7841) from Analog Devices. As a first step, 
set the direction of the signal to out and then pull it to high 
by writing 1 to corresponding addresses. For example, the 
direction LDACN signal is set to out and then pulled high 
with the following statement: 


XGpio WriteReg(XPAR_DAC_14BIT CONTROL 
LDACN_BASEADDR,XGPIO_TRI_OFFSET,0x0); 
//Direction is out 


XGpio WriteReg(XPAR_DAC_14BIT CONTROL 
LDACN_BASEADDR,XGPIO_DATA OFFSET,1); 
//Pulled high 


Delays between signals are achieved by using either the 
“for” loop or, alternatively, the “NOP” instruction. Equally 
important are the sequencing and setup-and-hold timings 
of each signal; you will find these specs in the device data- 
sheet. Here, each increment corresponds to 5 nanoseconds 
when the processor is running at 200 MHz. 
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Figure 2 - Timing diagram of the AD7841 DAC 
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Figure 3 - Timing diagram of the AD7891 ADC 
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Thanks to the Xilinx EMC IP core and the XBD file, interfacing 
and controlling memory was an easy task. We found timing 
constraints in the device datasheet and listed them in the XBD file. 


COMMUNICATION AND SRAM INTERFACES 

We will use XIL UART IP from Xilinx for interfacing RS232 
and RS422 driver ICs with our Virtex-5 FPGA. We chose two 
Maxim devices—the MAX3079 and MAX3237—for RS422 
and RS232 communication, respectively. We generated the 
control signals of the RS422 IC using the GPIO IP core. 

In terms of memory, we chose static RAM from Cypress, the 
CY7C1061BV33 device, interfacing it with the FPGA using the 
Xilinx External Memory Controller IP core (XIL_EMC). Thanks 
to this IP and the XBD file, interfacing and controlling memo- 
ry was an easy task. We found timing constraints of the device 
in the device datasheet and listed them in the XBD file. As per 
the selected device, you need to update different parameters in 
your XBD file. Because the Xilinx IP is sufficient for handling 
this memory, a Separate device driver was not required. 


ADC INTERFACING 

In this design, interfacing the ADC with the FPGA was a 
challenge, because all input analog signals—which are com- 
ing from the different sensors—are in the range of +10 volts. 
To meet the functional requirements of our example board, 
we chose the AD7891-1 from Analog Devices. This ADC has 
eight channels and a 12-bit data bus with the option of serial 
and parallel interfaces. In this design, the parallel interface 
proved to be the better option. 

This device operates on the 5-V input and since the FPGA 
VO voltage is 3.3 V, our design needed a transceiver for in- 
terfacing with the FPGA. Subsequently, we will refer to this 
transceiver as a buffer. On one side, the buffer is connect- 
ed with the FPGA and on the other side, it connects to the 
ADC. The FPGA handles the control signals of the buffer. 
You need to carefully handle the direction pin and output 
enable pin of the device to control the data flow from the 
FPGA to the ADC and vice versa. Control signals of the de- 
vice are grouped as a 5-bit register. 

End of conversion (EOCN) of the device should be listed 
separately in your custom XBD file, since this is an import- 
ant signal. EOCN indicates that conversion is completed 
and new data is available to the FPGA for processing. 


BUFFER INTERFACING 

In this design, we chose a level-shifting transceiver, the 
SN74ALVC164245 from Texas Instruments, as a buffer. It is a 
16-bit noninverting bus transceiver and has two separate ports 
as well as supply rails. Port B of the device is connected with 
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the 5-V ADC and Port A with the 3.3-V FPGA. This configura- 
tion allows signal translation from 5 V to 3.3 V and vice versa. 

With the direction pin set low, data at Port B is transferred 
to Port A. On high, the opposite will happen, and data from 
Port A will be transferred to Port B. To read the data from the 
ADC, set the output enable (OEN) and direction (DIR) pins 
to low. On a write operation, the FPGA will issue a command 
to the ADC, with the direction pin high and output enable 
pin low. In our example design, the output enable pins of the 
buffer are pulled high in order to avoid bus conflicts. We’ve 
used two buffers and two ADCs in this design. 

Figure 3 is a timing diagram for parallel interface 
mode, which is what we used for generating control sig- 


Figure 4 - The hardware/firmware development process 
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nals. Signal timings from t0 to t11 are referenced with the 
device datasheet from Analog Devices. 

You can create or modify this file using any editing program, 
such as Notepad or WordPad. You must save the file with the 
extension “.xbd.” Upon completion, the file has to be located 
in the particular path that will make it visible to the EDK tool 
directly, while creating a new project. When you start a new 
project, you need to select this file for the custom board. 

For example, to make a hypothetical board called “cus- 
tomboard” visible to the EDK tools, you must follow a spe- 
cific directory structure. This is a very important step. Our 
example “customboard_RevX_vX_X_0.xbd,” once created, 
has to be stored in the following directory structure: 


Example Path: 


C:\Xilinx\12.4\ISE_DS\EDK\board\customboard_ 
RevX\data\custom\ customboard_RevX_vX_X_0.xbd 
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In addition, you need to follow certain steps in the de- 
velopment of hardware to firmware for any application, as 
seen in Figure 4. 


FASTER DEVELOPMENT TIME 

It is always a challenging task for an embedded design- 
er making custom hardware to develop device drivers 
for his or her own particular application. Xilinx pro- 
vides extensive literature on the developmental pro- 
cess and a generic board-description file for the de- 
velopment. You can easily tailor this file and modify it 
to meet the needs of your own custom hardware. For 
any modification or change in hardware configuration, 
you will then need to look at only one single file and 
incorporate all the device-related changes into it. In 
this way, the custom XBD file increases productivity 
and accelerates development time. % 
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How to Bring an SML- 
Generated Peripheral 

with AXI4-Lite Interface 
into the Xilinx Environment 


(A 


The SMC Host Interface 
Block makes it simple to 
integrate a design you've 
created with the Synphony 
Model Compiler into a Xilinx 


embedded platform. 
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ynphony Model Compiler (SMC) is a 

model-based tool from Synopsys that syn- 

thesizes designs created in Simulink® and 
MATLAB? to generate optimized RTL for ASIC 
and FPGA targets. SMC includes a comprehen- 
sive high-level model library for creating math, 
signal-processing and communications designs in 
the Simulink environment. This library simplifies 
the capture of fixed- and floating-point single-rate 
or multirate algorithms and functionality debug 
in a high-level model-based design environment. 
Using these verified models, the SMC RTL Gen- 
eration Engine automatically creates RTL for 
hardware implementation and rapid exploration 
of multiple architectures for area, performance, 
power and throughput trade-offs. SMC’s high-level 
synthesis engine takes in the top-level design as 
well as MATLAB language input to generate RTL 
optimized for the chosen hardware target. Ad- 
ditionally, SMC automatically generates an RTL 
testbench for the design, along with a bit- and 
cycle-accurate C model and SystemC wrapper to 
enable validation of the generated hardware in a 
SystemC simulation environment. 

In many applications, designers create a pe- 
ripheral to perform some signal-processing 
function and must configure the peripheral via a 
host processor such as the Xilinx® MicroBlaze™ 
soft processor core. The host processor typical- 
ly connects to the peripheral using a standard 
bus interface such as AMBA® AXI4 or AXI4-Lite. 
The SMC library includes a Host Interface Block 
that implements a slave interface to the host 
processor. This Host Interface Block supports 
the AXI4-Lite, APB, Generic Interface and Ava- 
lon-MM bus interface protocol standards. The 
Host Interface Block also implements the neces- 
sary memory-map registers to configure the SMC 
design, including FIR filter coefficients, frequen- 
cy and phase settings of a numerically controlled 
oscillator (NCO) and FFT length of the vari- 
able-length FFT block. The Host Interface Block 
can implement these memory-mapped registers 
at any desired sample rate, including asynchro- 
nous, to the bus interface clock. You can specify 
the bus interface and the memory map settings 
in the Host Interface Block’s UI. Designers can 
use the Host Interface Block to connect an SMC 
design to a bus interconnect or to a bus master. 

Let’s take a closer look at how to import and 
integrate a peripheral designed with the SMC 
Host Interface Block into a Xilinx Embedded 
Development Kit (EDK) project. We will also 
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Figure 1 - SMC bus interface protocol specification 


examine how to simulate the AXI4-Lite 
transactions from a MicroBlaze host 
processor connected to the peripheral 
via a standard bus interconnect. There 
are four major steps to this process: 


1. Create the peripheral in Simulink 
using the IP and the Host Interface 
Block and SMC RTL Generation En- 
gine to generate the optimized RTL 
implementation for the design. 


2. Import the peripheral into the Xilinx 
EDK project and integrate it with the 
rest of the design. 


3. Develop the software application in 
the SDK. 


4. Generate the RTL and simulate it 
to check for functional correctness 
of the hardware and software. 


STEP 1: CREATE THE PERIPHERAL 
USING THE SMC LIBRARY 

The first thing you will do is to create 
the algorithmic implementation of 
the peripheral using the SMC library 
blocks, and verify the functionality. 
Next, configure the SMC Host Inter- 
face Block based on two factors: the 
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configuration data of the algorithmic 
part, which defines the memory-map 
parameters; and the system’s inter- 
connect bus protocol, which defines 
the bus interface parameters. Then 
connect the Host Interface Block to 
the algorithmic part of the peripheral. 
Some of the parameters of the Host 
Interface Block (e.g. the bus intercon- 
nect, address width and base address) 
will depend on the platform you are 
targeting. For our example, we have 
chosen a Xilinx Virtex®-7 FPGA as the 
platform and AXI4-Lite as the bus in- 
terface. This platform imposes some 
restrictions on the address width, base 
address and address space for each 
peripheral. The address width must 
be 32 bits, while the base address has 
to be a multiple of 4 kbytes and the 
minimum address space available is 4 
kbytes. Figures 1 and 2 show how the 
bus interface protocol and memory 
map are configured using the Host In- 
terface Block. 

For ease of integration, though not 
mandatory, it is highly recommended 
that the bus interface ports in the SMC 
model follow the naming conventions 


required by the Xilinx EDK. Append 
“S_AXI_” to the standard AXI4-Lite 
interface signal names. For example, 
the address signal of the AXI address 
write channel (AWADDR) should be 
named S_AXI_AWADDR. If the sig- 
nals do not follow the AXI4-Lite nam- 
ing convention, another opportunity 
exists to map the port name to the 
AXI4-Lite signal name while import- 
ing the peripheral into the Xilinx EDK. 
Additionally, do not use capital letters 
in the Simulink model name because 
the EDK does not support peripherals 
with capital letters in their names. 
Once you've added, configured and 
connected the Host Interface Block, 
generate the RTL for the peripheral us- 
ing SMC’s RTL Generation Engine. Spec- 
ify the target device, implementation pa- 
rameters and optimization constraints in 
the SMC UI to drive the RTL Generation 
Engine to produce optimized hardware 
for your target device. In the top RTL 
that SMC generated, add two virtual pa- 
rameters (generics in case your top RTL 
is VHDL) named “C_BASEADDR” and 
“C_HIGHADDR.” Assign their default 
value to the base address of your memo- 
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For ease of integration, though not mandatory, it is highly 
recommended that the bus interface ports in the SMC model 
follow the naming conventions required by the Xilinx EDK. 


ry-mapped space and the max address of 
the memory-mapped space as required 
by your IP. This step is necessary to en- 
sure that the EDK identifies your periph- 
eral’s memory-mapped address space. 
An example of the top-level Verilog RTL 
for the SMC-generated design is shown 
below, with the two parameters that you 
need to add highlighted. 


module host_inferface for edk top 


(__list_of_ports_will_be_available_ 


here __ 
); 


parameter C_BASEADDR = 
32'h41418000; 


STEP 2: IMPORT YOUR 
PERIPHERAL INTO XILINX 

EDK AND INTEGRATE IT 

The next step is to import the periph- 
eral hardware into the EDK’s Xilinx 
Platform Studio (XPS) and make the 
necessary connections (bus interface 
ports as well as functional ports) in 
the system. In our example, we have 
created a basic system with the Micro- 
Blaze processor, Block RAM (BRAM) 
for storing the software executable, 
a Local Memory Bus (LMB), the 
AXI4-Lite interconnect and the Micro- 
Blaze debug module. 

Select the “Create or Import Periph- 
eral” option in the Hardware category of 
the XPS GUI. This will open the Create 
and Import Peripheral wizard. Select 
the “Import existing peripheral” option 
in the wizard. Then, specify the path 
where you want the peripheral to be 


Base address Ox41418000 


stored, the design name and the file type 
(HDL). Now add all the SMC-generated 
RTL files. Upon successful compilation 
of the RTL, you will need to identify the 
bus interface the peripheral supports— 
namely, the AXI4-Lite slave interface, as 
shown in Figure 3. 

On the next screen in the wizard, select 
the AXI4-Lite ports of the peripheral and 
map them to the standard AXI4-Lite ports 
so that the EDK can connect the bus in- 
terface. If the names of the bus interface 
ports defined in the SMC model match the 
standard bus port names, the EDK will au- 
tomatically map the ports (see Figure 4). 

You may override the automatic map- 
ping if the port names do not match, as 
demonstrated for the AXI4-Lite clock (Clk- 
Div3) and reset (GlobalReset) signals. 

Next, specify the Register Space base 
and high address as the C_BASEADDR 
and C_HIGHADDR parameters insert- 


000000 


rn 
ui 


Figure 2 - SMC Host Interface Block memory-map parameter settings 
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Bus Interfaces 
Identify the bus interfaces supported by your peripheral SS 


A bus interface is a group of related interface ports distinguished by a bus standard (i.e. PLBV46, OCR. or FSL). Select the 
bus interface(s) supported by your peripheral or indicate if there is no applicable bus interface. 


Y Select bus interface(s) 
AXI bus interface 


Y AXIALito AXıa 
©) Master e Master 
e Slave © Slave 
Processor Local Dus (version 4.6) interface Fast Simplex Link bus interface 
PLOVAG Master (MPLD) FSL Master (MFSL) 
| | Generate burst FSL Slave (SFSU) 


|_| PLBV46 Slave (SPLO) 


Device Control Register bus interface 
DER Slave (SDCR) 


Figure 3 - Specifying the bus interface the peripheral supports—here, AXI4-Lite 


import Peripher al x 


S_AXI4LITE : Port 
Define the S_AXI4LITE bus interface portis) for this peripheral S 


The S_AXI bus interface is defined by a predefined set of ports and parameters. If your penpheral follows the standard 
naming conventions, this tool has automatically done the selections for you. Otherwise indicate the ports that correspond 
to the bus connectors 


Bus Interface Portis): S_AXI 

<I Bus Conne Your Port ¡el ATTENTION 
com | The Wizard was not able to automatically 

- map all bus interface ports for S_AXI. Please 

GlobaiReset | manually select your ports or modify your 
5_AXI_AWREADY 
S_AXI_WREADY 
5_AXI_BVALID 
S_AXI_ARREADY 
S_AXI_RDATA 


S_AXI_RVALID 


O DD sa ww BB un py» 


S_AXI_AWADDR 
S_AXI_AWVALID 
S_AXI_WDATA 
S_AXI_WVALID 
S_AXI_BREADY 


More info [ <Back |(_ next> || cancel | 


Figure 4 — Mapping the RTL ports to the appropriate AXI4-Lite bus interface signal 
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hostinterface_for_edK_v2_1_0.mpo (/slowfs/smc_scrat...ct 5/pcores/hostinterface for edk_vi_ 00 a/data) - GVIM 


Ble Edit Jools Syntax Buffers Window Help 


canada 99 > 


Ò Woe 8000 08 BR 


PARAMETER C_S_AXI PROTOCOL = AXI4LITE, TYPE = NON_HDL, ASSIGNMENT = CONSTANT, OT = STRING, BUS = S AXI 


a. Ports 
PORT c\kDiv2 = "", DIR = I, 
PORT C\kDiv3 = ACLK, DIR = I, BUS = S_AXI, 


PORT 
PORT 
PORT 
PORT 
PORT 


Status _ register out = **, 


GlobalReset = ARESETN, DIR = 1, BUS = S_AXI, 

DIR = 0, VEC = (7:0) 
S_AXI_WVALIO = WVALID, DIR = I, BUS = S_AXI 
S_AXI_WSTRB = WSTRB, DIR = I, VEC = [3:0], BUS = S_AXI 
S AXI WREADY = WREADY, DIR = 0. BUS = S AXI 


Figure 5 - The *.mpd file with signal types specified for clock and reset ports 


BOOBs SBA tc EIG GE RR 5 U 
P Catalog 2038... PIPA led 
Navigator X VOSELAB a larva y Ous interfaces | Ports | Addresses | P Es 
MEME onen LR = en rm i 
e w - 
= ¡Qescipto  eeeaaħŮÁ axitiite_0 dí au_rter... 1063 
= Memory and Memory Contr mieroblaze_0_dimb de imb vıo 200b 
7 = rc microblaze_0_ılmb É imb vio 200b 
a = Peripheral Controller + microblaze_0 dí mieroblaze 840b 
Run ORCs 
+ Processor = mucroblaze_0_bram_block dí bram_blo... 100a 
+ USER = microblaze_0_d bram_ctrl te Imb_bra... 310b 
implement Flow II = microblaze_0 1 bram_ctrt dí imb_bra... 310b 
+ Venficaton + debug module 
= Video and image Processing 
BA | = Project Local PCores | 
Generate Netist = USER 
IEEE 
- « I iD 
B _— ————, Master DSlave DMaster/Slave Target Cintistor WClonneited a Bm 
Search P Catalog: | Clear Producen Bucense (paid) Bucense (eval) Ürocal Libre Production beta ElDerelopment 
Generate BESTeam | | Asıperseded = Oc 
& Projet ¡8 P catalog L Design Summary m | System Assembly View 6 Graphical Design View = | 
Console 303% 
Export Design | A 
E., E 
i iD) 


5 | [i] Console | È Warnings | O) Errors | 


Figure 6 - SMC-generated design connected to the AXI4-Lite bus and MicroBlaze processor 


ed into the RTL in step 1. Uncheck the 
memory space option, because the Host 
Interface Block has an addressable con- 
figuration register space. But keep the 
default attributes of the RTL parameters 
unchanged on the next screen to ensure 
a match with the parameters specified in 
the Host Interface Block. 

The next screen is titled “Port At- 
tributes.” Here, you must specify the 
attributes of clocks or resets on any 
additional clocks or resets in the de- 
sign. Click “Finish” on the next screen 
to add the peripheral to the XPS proj- 
ect. The SMC peripheral has now been 
successfully imported into XPS. You 
can verify that this has happened by 
checking the <project_working_di- 
First Quarter 2014 


rectory>/pcores folder (XPS creates a 
directory with the name of the periph- 
eral here). Browse through this direc- 
tory to check that your RTL files are 
correctly imported. 

XPS will also create a directory 
called “data” in parallel with the HDL 
directory. This data directory includes 
the microprocessor peripheral de- 
scription (*.mpd) file, and information 
about the peripheral’s parameters and 
its port. Check that the SIGIS = CLK 
and SIGIS = RST parameters are de- 
fined on the clock and reset port. If 
they are not defined, edit the *.mpd file 
and add the definitions manually. Fig- 
ure 5 shows an example of this file with 
these parameters added. 


You will now see your peripheral in 
the “USER” subcategory under Project 
Local PCores in the IP Catalog section 
of the XPS GUI. Right-click on the pe- 
ripheral name and select “ADD IP.” An 
XPS Core Config window will open. Do 
not edit any of the parameters in this 
window—leave the default settings 
unchanged to comply with the Xilinx 
EDK flow. If EDK does not accept the 
specified address space, this indicates 
a conflict in the specified memory 
map with some other peripheral in the 
design. You must go back to SMC, re- 
generate the RTL with a new base ad- 
dress value and repeat the above steps 
to import the SMC peripheral. Click 
“OK” in the Core Config window once 
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Bie Edit Jools Syntax 


project 7.mhs (/slowtfs/amec _=cratch/Sheetal/Hoet interface Demo Using Vi 


Buffers Window Help 


CODO 5 BHO Vor BBO AOD 


PARAMETER HW_VER = 1.06.4 


PARAMETER C_INTERCONNECT_CONNECTIVITY_MODE = 0 


PORT interconnect_aclk = ¢lk_100_0000MHz 


PORT INTERCONNECT_ARESETN = proc_sys_reset_0_Interconnect_aresetn 


END 


BEGIN hostinterface_for_edk 


PARAMETER INSTANCE = hostinterface_for_edk_0 


PARAMETER HW_VER = 1,00.a 
PARAMETER C_BASEADOR = 0x41418000 
PARAMETER C_HIGHADOR = Ox4l4lerrr 
BUS_INTERFACE S_AXI = axi4lite_0 
FORT CikDiv3 = clk_100_o000Miz 


PORT GlobalReset = proc_sys_reset_0_Interconnect_aresetn 


PORT cikDiv2 = clock generator O CLKOUTI 


PORT Status _ register _out = hostinterface_for_edk_0_ Status _register_out 


END 


Figure 7 - Clock and reset port connections inserted in the *.mhs file 


Project Options 


General | Design Flow 


Design Flow Options 


Default effort level to run FPGA implementation tools is: xflow (single iteration) 
xplorer scripts (multiple iterations for best result) has been removed. 


Please use smartxplorer in ISE 


¥ Treat timing closure failure as an error 


HDL 
“= VHOL 


Simulation Test Bench 


Y Generate test bench template 


Simulation Models 


> Behavioral Structural 
External Memory Simulation 
Enable External Memory Simulation 


Verilog 


Timing 


| OK || cancel |{ Help 


Figure 8 - Project setup to generate the testbench template and behavioral simulation model 


the correct base and high address val- 
ues are available. 

XPS will open the “Instantiate and 
Connect IP” GUI. You can direct the tool 
to automatically link the peripheral to the 
interconnect bus driven by an available 
processor, or you may choose to man- 
ually connect the peripheral. Once the 
connection is complete, you will see the 
interface connection shown in Figure 6. 

Cross-check that the AXI4-Lite re- 
lated clock and reset are connected 
in the bus interface connection in the 
Graphical Design View tab. If they are 
not automatically connected, edit the 
microprocessor hardware specifica- 
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tion file <project_name>.mhs in the 
<project_working_directory>. Figure 7 
shows the *.mhs file with the clock and 
reset ports added. 

Next, connect the non-AXI4-Lite ports 
of the design using the Ports tab of the 
System Assembly View window. In the Ad- 
dresses tab of the System Assembly View, 
ensure that the peripheral address space is 
visible and the address range is locked. 

Now, export the hardware to the 
Xilinx Software Development Kit by 
selecting the “Export hardware de- 
sign to SDK” option in the Project 
category of the XPS GUI. You need 
not generate the bitstream if you just 


want to run RTL simulation. After 
completion, XPS will create an *.xml 
file that describes the hardware to 
the SDK. This file is normally created 
in the <project_ working_directory>/ 
SDK/SDK_Export/hw folder. 


STEP 3: DEVELOP SOFTWARE 
DRIVER USING XILINX SDK 

The next step in the integration is to 
develop the software driver using the 
Xilinx SDK. Launch SDK and create a 
hardware platform specification proj- 
ect to source the *.xml file. If you have 
selected the “Export and Launch” op- 
tion in EDK, the project is created auto- 
matically and the IP blocks and address 
map information in XPS is now avail- 
able in the SDK project. 

Before creating the board support pack- 
age (BSP), you must create driver files for 
your peripheral. A typical driver header 
file must define the memory-mapped reg- 
ister offset address and the prototype to 
read and write to those registers. 

Copy the driver files to the SDK proj- 
ect repository to identity the driver and 
then create the BSP project. Open a 
new, blank application project to create 
software to read and write data from 
the peripheral. In this project, specify 
the hardware target platform that was 
created as the first task in this step and 
the BSP that you have just created. In 
the BSP settings window, see the pe- 
ripheral driver core for the design. 

The application project includes a 
main.cc file where the software code 
is written for the application. Once this 
file is created, the SDK automatically 
compiles the code and creates an *.elf 
file that is used to simulate the software 
code in the RTL simulation environment. 

The application project includes a 
main.cc file where the software code for 
the application is written. A simple exam- 
ple is a program that writes values into the 
first two registers of the memory map (the 
Status_register and the Control_register). 
Once this file is created, the SDF auto- 
matically compiles the code and creates 
an *.elf file that is used to simulate the 
software code in the RTL simulation envi- 
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It's a simple matter to use the Host Interface Block to integrate 
an SMC-generated peripheral with the Xilinx 
embedded platform using an AXI4-Lite slave interface. 


ronment. This *.elf file can then be used to 
verify functionality, as shown in Step 4. 


STEP 4: GENERATE 

THE RTL FILES IN XPS 

The final step is to generate the RTL 
in XPS and simulate it to verify that the 
hardware and software are function- 
ally correct. In the XPS GUI, choose 
the “Select Elf file” option in the Proj- 
ect category and then “Choose Simu- 
lation Elf file” to specify the path of 
the Elf file that the SDK has created. 
This file is available in the application 
project folder. To create the testbench 
template and behavioral simulation 
model (see Figure 8), you will choose 
the Design Flow tab settings of the 


FS file Edit View Simulation Window Layout Help 
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acusan aa 
[_tactanen nad Meno Slama 
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Project Options in the Project cate- 
gory in the GUL 

Next, create the HDL files by se- 
lecting “Generate HDL Files” and 
launch the ISE® (ISim) HDL simula- 
tor to check functional correctness. 
Figure 9 shows the AXI4-Lite trans- 
action initiated by the MicroBlaze 
processor for the software C code in 
the SDK project. Note that the first 
two registers of the peripheral (Sta- 
tus_register and Control_register) 
change their value as expected. The 
corresponding AXI4-Lite signals at 
the interface of the peripheral show 
that the SMC-created peripheral is 
successfully integrated in the em- 
bedded project. 


¡Sim (P.30xd) - [Default wcfg*] 


09): 7 


Fr! Control register [j1211 11 111111 


ns ) 


LLULL 

X1 2.998 

DIE | DON 
Default wcfg* x | 


POWERFUL TOOL SET 

It’s asimple matter to use the Host Inter- 
face Block to integrate an SMC-generat- 
ed peripheral with the Xilinx embedded 
platform using an AXI4-Lite slave in- 
terface. The combination of SMC and 
the Xilinx embedded platform makes a 
powerful tool set to design and develop 
DSP peripherals integrated with a host 
processor. The Host Interface Block in 
SMC provides the interface necessary 
to complete the integration seamlessly 
and create powerful solutions for em- 
bedded platforms. 

To learn more about SMC, visit http:// 
www.synopsys.con/Systems/BlockDe- 
sign/HLS/Pages/Synphony-Model-Com- 
piler.aspx?cmp=fpga-xcell-S6-smc o 
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Figure 9 - AXI4-Lite transaction initiated by the MicroBlaze processor 
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XTRA, XTRA 


Whats New in the 
Vivado 2013.4 Release? 


Xilinx is continually improving its products, IP and design tools as it strives to help 
designers work more effectively. Here, we report on the most current updates to Xilinx 
design tools including the Vivado® Design Suite, a revolutionary system- and IP-centric 
design environment built from the ground up to accelerate the design of Xilinx® 

All Programmable devices. For more information about the Vivado Design Suite, 
please visit www.xilinx.com/vivado. 


Product updates offer significant enhancements and new features to the Xilinx design tools. 
Keeping your installation up to date is an easy way to ensure the best results for your design. 


The Vivado Design Suite 2013.4 is available from the Xilinx Download Center at 
www.xilinx.com/download. 


VIVADO DESIGN SUITE: 
VIVADO DESIGN SUITE DESIGN EDITION UPDATES 
2013.4 RELEASE HIGHLIGHTS 


The Vivado Design Suite 2013.4 features support for UltraScale™ devices as 
well as significant enhancements to IP Integrator, Vivado HLS, Vivado 
synthesis and the Incremental Design Flow. 


Vivado IP Integrator 

The Vivado Design Suite IP Integrator 
supports more than 50 new pieces of IP, 
including: 

Device Support 

e Connectivity IP 


The following devices are production ready: e CPRI™ and JESD204 


e Artix®-7 XC7A35T and XC7A50T FPGAs e GMII to RGMII 
e Zynq®-7000 XC7Z015 All Programmable SoC e Virtex®-7 PCIe (Gen2 and Gen3) 
e RXAUI and XAUI 
Tandem Configuration for Xilinx PCle® IP e 10Gig Ethernet MAC and PCS PMA 
The following devices have moved to production status: e SelectIO™ Wizard 


e Kintex®-7 7K410T FPGA 
En e An entire Block Design (BD) can be 


set as an “out-of-context module” to 
reduce synthesis times on unchanged 
blocks when doing design iterations. 


e Virtex®-7 X550T FPGA 


e User IP can now be repackaged after 
it has been added to a diagram and all 
instances of the IP used in that project 
are updated to reflect the changes. 
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e Support has been added for “remote 
sources.” Users need to create a tem- 
porary project to build the initial BD 
in a remote location. 


IP Integrator now supports a “non-proj- 
ect flow” using “read_bd.” 


New designer assistance has been 
added around AXI slaves, Block RAM 
controllers, Zynq All Programmable 
SoC board presets and AXI-Ethernet. 


IP Integrator now supports address 
widths between 32 and 64 bits. This is 
useful for designing multiported mem- 
ory controllers in IP Integrator. 


CTRL-F can now be used to find an IP 
or object on the IP Integrator canvas. 


The new “Make Connection” option 
simultaneously connects multiple 
objects. 


e Users can customize AXI4 interface 
colors in a diagram based on AXI4 
interface types. The default is still to 
have all interfaces displayed as the 
same color. 


Vivado Synthesis 

e Several quality-of-results improve- 
ments for DSP, including multiply-ac- 
cumulate functions, can leverage dy- 
namic opmode and fully map onto a 
single DSP block. 


e Wide multipliers using more than one 
DSP block are improved through bet- 
ter allocation of pipeline registers. 


e FIR filter inference results in push- 
button QoR (see, for example, the 
741-MHz filter described in UG479). 


Vivado Implementation Tools 

The Incremental Compile flow silently 
ignores the Pblock constraints when 
they conflict with reused placement 
and honoring the Pblock constraints 
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would result in worse timing perfor- 
mance. Better control over Pblock be- 
havior in the Incremental Compile flow 
will be addressed in a future release. 
Additional Incremental Compile flow 
changes include: 


e An automatic incremental reuse re- 
port is issued after read_checkpoint 
-incremental. 


e A new incremental reuse report sec- 
tion lists conflicts between reused 
placement and physical constraints in 
the current design. 


VIVADO DESIGN SUITE: 
SYSTEM EDITION UPDATES 


Vivado High-Level Synthesis 
Vivado Design Suite 2013.4 HLS updates 
include: 


e Smoother integration of HLS designs 
into AXI4 systems is provided through 
new data-packing options that auto- 
mate the alignment of data to 8-bit 
boundaries. 


e Enhanced functionality is provided 
for AXI4 master interfaces as the user 
ports can now be optionally included 
in the interface. 


e Improved resource usage is provided 
for designs using division operations. 
These operations now automatically 
benefit from smaller implementations. 


System Generator for DSP 

System integration of System Generator 
for DSP blocks is now faster and easier 
with AXI4-Lite slave drivers along with 
the existing bare-metal driver support. 


e Verification is improved with the sup- 
port for non-memory-mapped inter- 
faces in hardware co-simulation. 


LEARN MORE ABOUT THE 
VIVADO DESIGN SUITE 
AND ULTRAFAST DESIGN 
METHODOLOGY 


UltraFast Design Methodology 

In order to further strengthen this of- 
fering and enable accelerated and pre- 
dictable design cycles, Xilinx now de- 
livers the first comprehensive design 
methodology in the programmable 
industry. Xilinx has hand-picked the 
best practices from experts and dis- 
tilled them into an authoritative set of 
methodology guidelines known as the 
UltraFast™ Design Methodology for 
the Vivado Design Suite. 


The UltraFast Design Methodology en- 
ables project managers and engineers 
to accelerate time-to-market and quickly 
tune their sources, constraints and set- 
tings to accurately predict schedules. 
The new “Design Methodology Guide” 
covers all the aspects of: 


e Board and device planning 

e Design creation and IP integration 
e Implementation and design closure 
e Configuration and hardware debug 


Vivado QuickTake Tutorials 
Vivado Design Suite QuickTake vid- 
eo tutorials are how-to videos that 
take a look inside the features of the 
Vivado Design Suite. Topics include 
high-level synthesis, simulation and 
IP Integrator, among others. New 
topics are updated regularly. 


Vivado Training 

For instructor-led training on the Vivado 
Design Suite, please visit; 
www.xilinx.com/training. % 
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XPEDITE 


Latest and Greatest from the 
Xilinx Alliance Program Partners 


Xpedite highlights the latest technology updates 
from the Xilinx Alliance partner ecosystem. 


he Xilinx® Alliance Pro- 

| gram is a worldwide 
ecosystem of qualified 
companies that collaborate 
with Xilinx to further the de- 
velopment of All Programma- 
ble technologies. Xilinx has 
built this ecosystem, leverag- 
ing open platforms and stan- 
dards, to meet customer needs 
and is committed to its long- 
term success. Alliance mem- 
bers—including IP providers, 
EDA vendors, embedded soft- 
ware providers, system inte- 
grators and hardware sup- 
pliers—help accelerate your 
design productivity while min- 
imizing risk. Here are reports 
from four of these members. 


-s 
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AVNET SOFTWARE-DEFINED 
RADIO KIT SHOWCASES 
ZYNQ SOC 


http:/www.zedboard.org/product/ 
zynq-sdr-ii-eval 


The Zynq SDR-I Evaluation Kit 
from Avnet (Phoenix) combines the 
ZedBoard with the Analog Devices 
AD-FMCOMMS2-EBZ FMC module, 
which features the Analog Devices 
AD9361 integrated RF Agile Trans- 
ceiver. This kit enables a broad range 
of transceiver applications for wire- 
less communications. Tuned to a 
narrow RF range in the 2,400-MHz to 
2,500-MHz region, the kit is ideal for 


the RF engineer seeking optimized 
system performance that meets data- 
sheet specifications in a defined 
range of RF spectrum. The kit also 
includes four Pulse LTE blade anten- 
nas, the Xilinx Vivado® Design Edi- 
tion (device locked to ZC7020), an 
8-Gbyte SD card, a “Getting Started” 
card and downloadable documenta- 
tion and reference designs. 


Avnet's second-generation Zynq SDR kit, the Zynq SDR-II 
Evaluation Kit, combines the ZedBoard with the Analog Devices 
AD-FMCOMMS2-EBZ FMC module, which features the Analog 
Devices AD9361 integrated RF Agile Transceiver. In a more focused 
RF range, the kit's enhanced functionality enables a wide range 
of transceiver applications for wireless communications. 
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OMNITEK ZYNQ 
SOC BROADCAST 
DEVELOPMENT KIT 


http://omnitek.tv/xilinx-02745-dev- 
platform 


The OZ745 Kit from OmniTek (Bas- 
ingstoke, U.K.) incorporates all the 
basic components of hardware, design 
tools, IP and preverified reference de- 
signs needed to rapidly develop video- 
and image-processing designs, along 
with a general board support package. 
The OZ745 board is delivered with an 
evaluation reference design that rec- 
ognizes and displays SDI, HDMI and 
analog video inputs, and displays a test 
pattern on the SDI and HDMI video 
outputs. Also supplied is a demonstra- 
tion of Xilinx’s RTVE 2.1 reference de- 
sign, which has the OmniTek Scalable 
Video Processor IP (OSVP) at its core. 
The reference design deinterlaces and 
resizes four video inputs and compos- 
ites them onto the video output. Con- 
trol software runs on the Zynq® All 
Programmable SoC’s dual-core Cor- 
tex™-A9 processor, which uses a Li- 
nux build with Qt graphics support. An 
OmniTek 2D graphics IP core provides 
graphics acceleration. The control 
software generates a web page that can 
be hosted locally and composited over 
the video on the SDI, HDMI or LVDS 
flat-panel display output. Control is via 
mouse and keyboard. 


INTOPIX’S VISUALLY 
LOSSLESS VIDEO 
COMPRESSION BOASTS 
TINY FOOTPRINT FOR 
ARTIX-7 AND SPARTAN-6 


http://www.intopix.com/uploaded/ 
Download%20Products/intoPIX-TI- 
CO%2OFLYER_XILINX pdf 


TICO Compression from  intoPIX 
(Mont-Saint-Guibert, Belgium) is a new 
patent-pending visually lossless light 
compression technology specifically 
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designed for the AV industry. This rev- 
olutionary technology's extremely tiny 
footprint means it fits in the smallest 
Xilinx Artix®-7 and Spartan®-6 devices. 
Yet, TICO is robust enough for real-time 
operation with no latency. 

Up to now, image and video have 
been sent or stored uncompressed 
into many displays and systems such 
as cameras, videos servers and re- 
corders. TICO is a smart upgrade path 
for managing higher resolutions (4K, 
8K and beyond) and frame rates while 
assuring visual quality, keeping pow- 
er and bandwidth within budget, and 
significantly reducing the complexity 
and cost of the system. 

TICO is ideal for applications from 
HD to Ultra HD including DVRs, video 
servers, high-resolution and high-speed 
cameras, video-over-IP systems, surveil- 
lance systems and cable extenders. The 
technology especially suits applications 
that support higher data streams on ex- 
isting networks, because it increases 
the number of streams in a multistream 
configuration, slashes the internal vid- 
eo bandwidth and associated power 
consumption, and reduces the number 
of lanes need to transport a stream in a 
display interface. 


SILICON SOFTWARE AND 
MVTEC SOFTWARE DELIVER 
SMART, COMPACT VISION 
SYSTEM ON ZYNQ SOC 


http://press.xilinx.com/2013-11-21- 
Kilinx-All-Programmable-Solutions- 
for-Smarter-Factories-and-Smart- 
er-Vision-Showcases-at-SPS-IPC- 
Drives-2013 


A pair of German companies—Silicon 
Software (Mannheim) and MVTec Soft- 
ware GmbH (Munich)—have teamed 
up to deliver a demonstration of high- 
speed optical character recognition 
(OCR) systems performing real-time 
silicon device code recognition. The 
solution utilizes the Zynq SoC and the 
HALCON machine vision software 


from MVTec. The companies accom- 
plished hardware acceleration using 
VisualApplets from Silicon Software. 
VisualApplets, a software tool for pro- 
gramming image-processing tasks on 
Xilinx devices, reduces development 
time and design complexity. 

The functional description takes 
place on the base of graphical block dia- 
grams; furthermore, with function mod- 
ules, designs are compiled that can be 
synthesized into executable hardware 
code. The modules are organized in the- 
matic libraries and cover the essential 
functional areas of image processing, 
from basic to complex operators. 

Although knowledge in hardware 
programming is an advantage, Visu- 
alApplets is directed at software de- 
velopers. With the help of ISE®, the 
graphical representation of the hard- 
ware description (VisualApplets de- 
sign) is transformed into a Xilinx-tar- 
geted bitstream that is compiled to a 
VisualApplets hardware applet with 
additional runtime information. De- 
signers can then load this applet to 
the frame grabber via microDisplay or 
SDK instructions. 

HALCON is the comprehensive 
standard software for machine vision 
with an integrated development envi- 
ronment (IDE) that is used worldwide. 
HALCON’s flexible architecture facili- 
tates rapid development of machine-vi- 
sion, medical-imaging and image-anal- 
ysis applications. The architecture 
provides outstanding performance and 
a comprehensive support of multicore 
platforms, SSE2 and AVX, as well as 
GPU acceleration. It serves all indus- 
tries with a library of more than 1,800 
operators for blob analysis, morphol- 
ogy, matching, measuring, identifica- 
tion and 3D vision, to name just a few. 
Contact Silicon Software (http://www. 
silicon-software.info/en/) and MVTec 
(http: /www.mvtec.com/products/) for 
more information. ° 
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XAMPLES... 


Application Notes 


If you want to do a bit more reading about how our 
FPGAs lend themselves to a broad number of applications, 
we recommend these application notes. 


XAPP1179: USING TANDEM CONFIGURATION FOR 
PCIE IN THE KINTEX-7 CONNECTIVITY TRD 
http:/Ahwww.xilinx.com/support/documentation/application_ 
notes/xapp1179-tandem-config-pcie.pdf 


The PCI Express® specification requires the PCIe* link to be 
ready to connect with a peer within 120 milliseconds after 
power is stable. Meeting this requirement is a challenge for 
large FPGAs using flash memory for configuration due to 
the size of the programming bitstream and the configuration 
rates available. The Tandem Configuration approach from 
Xilinx® is a practical way to reduce FPGA configuration 
time to meet the 120-ms PCle link-training requirement. 

This application note by Sunita Jain, Mrinal Sarmah and 
David Dye shows how to use the Tandem PROM and the 
Tandem PCIe configuration methods with the Kintex®-7 
Connectivity Targeted Reference Design (TRD) running 
on the KC705 evaluation board with a Kintex-7 XC7K325T 
FPGA. The design describes the adjustments made to the 
TRD to accommodate Tandem Configuration. Using this 
approach, the base bitstream size, and therefore the ini- 
tial configuration time, is reduced by more than 85 percent 
when using Tandem PROM and more than 80 percent when 
using Tandem PCIe. 


XAPP1184: PIPE MODE SIMULATION USING 
INTEGRATED ENDPOINT PCI EXPRESS BLOCK IN 
GEN2 X8 CONFIGURATION 
http:/hwww.xilinx.com/support/documentation/application_ 
notes/xapp1184-PIPE-mode-PCle.pdf 


Verifying designs involving high-speed serial protocols such 
as PCI Express can be complex and time-consuming. Many 
verification projects utilize third-party bus functional models 
(BFMs) to reduce the complexity of the verification process 
and to speed up the time spent running the actual simula- 
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tion. Gigabit transceivers are a particular problem for verifi- 
cation, since the GTs often consume a significant number of 
processor cycles to simulate. For this reason, and because 
GTs typically have little impact on the behavior of the upper 
PCI Express layer functionality, many verification projects 
bypass them for much of their verification and only simulate 
with GTs to validate the design at the end of a project 

The PHY Interface for the PCI Express Architecture 
(PIPE) is a specification for linking the PCI Express block 
and the GTs. This application note by K. Murali Govinda Rao 
and A. V. Anil Kumar provides a way to connect the PIPE 
interface of the PCI X-actor BFM (in root complex mode) 
from Avery Design Systems to the PIPE interface of a Xil- 
inx 7 series FPGA Integrated PCI Express Endpoint Block. 
When configured with the proper options, the Xilinx PCIe 
Endpoint will have PIPE ports at the core’s top level. You 
can connect these ports to the X-actor RC BFM to bypass 
simulating with the GTs. 

While this application note demonstrates specific connec- 
tions to the Avery BFM, it can also serve as a model for how 
to connect other third-party BFMs to the Integrated PCIe 
Endpoint Block through the PIPE interface. PIPE-mode sim- 
ulation is very useful for reducing the simulation time during 
verification of complex PCI Express applications. 


XAPP1097: IMPLEMENTING SMPTE SDI INTERFACES 
WITH ARTIX-7 FPGA GTP TRANSCEIVERS 
http:/hvww.xilinx.com/support/documentation/application_ 
notes/xapp1097-smpte-sdi-a7-gto.paf 


The serial digital interface (SDI) family of standards from the 
Society of Motion Picture and Television Engineers (SMPTE) 
is widely used in professional broadcast studios and video 
production centers to carry uncompressed digital video, 
along with embedded ancillary data such as multiple audio 
channels. The Xilinx SMPTE SD/HD/3G-SDI LogiCORE™ 
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IP is a generic SDI receive/transmit datapath that does not 
have any device-specific control functions. This application 
note provides a module containing control logic to couple 
the Xilinx SDI core with the Artix®-7 FPGA GTP transceiv- 
ers to form a complete SDI interface. Author John Snow de- 
scribes this additional control and interface logic and pro- 
vides the necessary control and interface modules in both 
Verilog and VHDL source code. Also supplied is a wrapper 
file that contains an instance of the control module for the 
GTP transceiver and the SDI core with the necessary con- 
nections between them. This wrapper file simplifies the pro- 
cess of creating an SDI interface. The application note also 
provides two example SDI designs that run on the Artix-7 
FPGA AC701 evaluation board. 


XAPP1094: CTLE ADAPTATION LOGIC FOR 7 SERIES 
FPGA GTX TRANSCEIVERS 
http://www.xiuinx.com/support/documentation/application_ 
notes/xapp1094-ctle-adaptation-gtx.pdf 


The 7 series FPGA GTX receiver in digital front-end (DFE) 
mode contains an automatic gain control (AGC) block and 
a continuous-time linear equalizer (CTLE) block to com- 
pensate for channel loss. Both the AGC block and the CTLE 
wideband gain stages aim to boost frequencies within the 
operating frequency range of the GTX transceiver to opti- 
mize the eye height of the received signal. Although AGC is 
auto-adaptive, the CTLE wideband stage is not. By default, 
the user adjusts the wideband gain by analyzing the channel 
loss. This application note by David Mahashin explains how 
to use amodule implemented in the FPGA logic to automat- 
ically adjust CTLE wideband gain. The one-time calibration 
occurs after deassertion of GTRXRESET, RXPMARESET 
or RXDFELPMRESET. This feature allows dynamic adjust- 
ment of the equalizer gain without requiring channel-loss 
analysis. The module is included in the 7 series FPGAs’ 
Transceivers Wizard design as an optional feature. 


XAPP1185: ZYNQ-7000 PLATFORM SOFTWARE 
DEVELOPMENT USING THE ARM DS-5 TOOLCHAIN 
http:/Ahwww.xilinx.com/support/documentation/application_ 
notes/xapp1185-Zyng-software-development-with-DS-5.pdf 


This document offers guidance on using the ARM® Develop- 
ment Studio 5 (DS-5) design suite to develop, build and de- 
bug bare-metal software for the Xilinx Zynq*-7000 All Pro- 
grammable SoC, which is based on the ARM Cortex™-A9 
processor. Authors Simon George and Prushothaman Pala- 
nichamy walk you through the process, beginning with cre- 
ating a board support package (BSP) for custom hardware 
design within the Xilinx software development kit (SDK), 
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then importing that BSP into the DS-5 tools for a build. Next 
comes the creation of the first-stage boot loader and finally, 
the debugging of the target configuration for your Zynq SoC 
custom design. 


XAPP1182: SYSTEM MONITORING USING THE 
ZYNQ-7000 AP SOC PROCESSING SYSTEM WITH 
THE XADC AXI INTERFACE 

http /www.xilinx.com/support/documentation/application_ 
notes/xapp1182_zynq_axi_xade_mon.pdf 


This application note by Mrinal J. Sarmah and Radhey S. Pan- 
dey describes how to use a Xilinx analog-to-digital converter 
(XADC) for system monitoring applications. The XADC 
Wizard IP offers an AXI4-Lite interface that connects to the 
AXI general-purpose port in a Zynq-7000 All Programmable 
SoC processing system to get system control information 
from the XADC. The XADC block provides dedicated alarm 
output signals that trigger based on preset events. A Linux 
application running on the Zynq SoC’s ARM Cortex-A9 CPU 
controls the alarm threshold of the XADC and monitors the 
alarm output. This design also explores the possibility of us- 
ing an external auxiliary channel through the AXI4-Lite inter- 
face and characterizes the maximum signal frequency that 
can be monitored using that interface. 


XAPP1183: IMPLEMENTING ANALOG 

DATA ACQUISITION USING THE ZYNQ-7000 

AP SOC PROCESSING SYSTEM WITH 

THE XADC AXI INTERFACE 

http /Wwww.xilinx.com/support/documentation/application_ 
notes/xapp1183-zyng-xadc-axi.pdf 


In a second application note, the same authors go on to de- 
scribe how the XADC acquires analog data using its dedicat- 
ed Vp/Vn analog input. The design by Mrinal J. Sarmah and 
Radhey S. Pandey implements a use case where the XADC 
out data is transferred directly to system memory using the 
Xilinx direct memory access (DMA) IP. A Linux-based appli- 
cation running on a Zynq-7000 All Programmable SoC pro- 
cessing system reads the buffer from memory. Then, a Lab- 
VIEW-based application GUI gathers the data and performs 
fast Fourier transform (FFT) processing on it to quantify the 
signal-to-noise ratio (SNR) of the XADC out data. 

This design provides a platform for using the AXI4- 
Stream interface of the XADC Wizard IP for analog 
data-acquisition applications. The authors show how to 
use AXI DMA to transfer the XADC samples into proces- 
sor memory without processor intervention. The design 
quantitatively analyzes the XADC performance metric for 
different frequency tones. % 
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DANIEL GUIDERA 


XCLAMATIONS! 


Xpress Yourself 
in Our Caption Contest 


GLENN BABECKI, principal 
system engineer II at Comcast Corp. 
(Horsham, Pa.), won a shiny new 
Avnet ZedBoard with this caption 
for the ping-pong cartoon in 
Issue 85 of Xcell Journal: 


“... and that, my friends, is how 


o said tattoos can't be a professional asset? If you've ever wished I came up with the concept of 
‘multicore shared ping-pong buffers! ” 


you had your latest circuit diagram at your fingertips, inking it on 


your arm might be an alternative. Xercise your funny bone Congratulations as well to 


by submitting an engineering- or technology-related caption for this cartoon our two runners-up: 

showing an engineer admiring a colleague’s work-inspired tats. The image 

might inspire a caption like “Nice, right? I got them at a new booth they set up “Pong would have been in 3D from 

the beginning if Nolan Bushnell had 
had access to FPGAs back then.” 


at the IEEE conference on programmable devices I went to last week.” 
Send your entries to xcell@xilinx.com. Include your name, job title, company 


affiliation and location, and indicate that you have read the contest rules at www. — Wolfgang Friedrich, 
xilinx.com/xcellcontest. After due deliberation, we will print the submissions electrical engineer, product development, 
SMART Technologies 


we like the best in the nextissue of Xcell Journal. The winner will receive an Dig- 
ilent Zynq Zybo board, featuring the Xilinx® Zynq®-7000 All Programmable SoC 
(http:/Awww.xilinx.com/products/boards-and-kits/1-4AZFTE.htm). Two runners-up 
will gain notoriety, fame and a cool, Xilinx-branded gift from our swag closet. 

The contest begins at 12:01 a.m. Pacific Time on Jan. 17, 2014. All entries 
must be received by the sponsor by 5 p.m. PT on April 1, 2014. 

al ee — Marcel Ursu, 
So, think ink ... and get writing! FPGA engineer, AdvancedIO 
(Vancouver, British Columbia, Canada) 


(Calgary, British Columbia, Canada) 


“While interviewing for the chief 
architect position, Jim is asked to 
prove his multitasking abilities.” 


NO PURCHASE NECESSARY. You must be 18 or older and a resident of the fifty United States, the District of Columbia or Canada (excluding Quebec) to enter. Entries must be entirely original. Contest begins on Jan. 17, 
2014. Entries must be received by sponsor by 5:00 pm Pacific Time (PT) on April 1, 2014. Official rules are available online at www.xilinx.com/xcellcontest. Sponsored by Xilinx, Inc. 2100 Logic Drive, San Jose, CA 95124. 


66 Xcell Journal First Quarter 2014 


Trust Synopsys’ FPGA Synthesis Solutions 
to deliver the fastest time-to- market 
for your FPGA design” 


FPGAs keep getting bigger, but your schedule is not. There is no 
time to wasteon numerous design iterations and long tool runtimes. 
Use B@ärchical and incremental techniques available in Synopsys’ 
Sýr yniplify® software to bring in a ue and meet aggressive 


germane goals. 


Yi . learn more about how Sy opsys FPGA design tools accelerate 


design bring-up, visit Hh “synopsys.com/fpgafastturnaround. 
nn, Y 
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Accelerating Innovation 


Xilinx Introduces 


The UltraFast' Design Methodology 
for the Vivado” Design Suite 


UlirarAST 


Design Methodology i 


athodoloav from Xilinx enahles The UltraFast Design M 


